Mini Project Report
Mini Project Report
“ J N A N A S A N G A M A ” B E L A G A VI – 5 9 0 018
                                       An
                          [Technical Seminar report on]
Submitted in the partial fulfillment of the requirements for the award of the degree of
                    BACHELOR OF ENGINEERING
                               IN
                COMPUTER SCIENCE AND ENGINEERING
                  (Accredited by NBA, New Delhi, validity up to 30.06.2026)
                                   SUBMITTED BY
                       Sahana S.H                     4JD22CS406
                                      2023 - 2024
ABSTRACT
   i
                                          ACKNOWLEDGEMENT
   Although a single sentence hardly suffices, we would like to thank almighty God for blessing me with
   his grace and taking my endeavor to a successful culmination.
    I extend my sense of gratitude to Prof. Sameer.B .Internship Co-ordinator, Dept.of CS&E, JIT,
Davangere, for extending support and cooperation which helped me in completion of the Internship.
              We extend our sense of gratitude to Dr. Mouneshachari S, Professor & Head, Department of
   CS&E, JIT, Davangere, for extending support and cooperation which helped us in completion of the
   project.
              We express our sincere thanks to Dr. Ganesh D B, Principal and Director, J.I.T, Davangere,
   for extending support and cooperation which helped us in the completion of the project.
              We would like to extend our gratitude to all staff of Department of Computer Science and
   Engineering for the help and support rendered to us. We have benefited a lot from the feedback,
   suggestions given by them.
              We would like to extend our gratitude to all our family members and friends especially for their
   advice and moral support.
CHANKRIKA S (4JD21CS015)
DRASHAN P K (4JD21CS016)
SAHANA S H (4JD22CS0406)
                                                         ii
                                  CONTENTS
Page No.
ABSTRACT                                             i
ACKNOWLEDGEMENT                                      ii
CONTENTS                                             iii
CHAPTER 1: INTRODUCTION                              1-2
           1.1 Overview of the Project
           1.2 Objectives
           1.3 Scope
           3.1 Introduction
           3.2 System Requirement Specification
           3.3 Working Explanation
           3.4 Algorithms
           3.5 KNN Algorithm
           3.6 Logistic
           3.7 Overview of Dataset
           3.8 Methodology
CHAPTER 5: RESULT 22
SNAPSHOTS 23-24
CONCLUSION 25
REFERENCES 26
                                         iii
Image Caption Generator                                                                   2023-2024
CHAPTER 1:
                                       INTRODUCTION
Every day, we are bombarded with photos in our surroundings, on social media, and in the news.
Only humans are capable of recognizing photos. We humans can recognize photographs without
their assigned captions, but machines require images to be taught first. The encoder-decoder
architecture of Image Caption Generator models uses input vectors to generate valid and acceptable
captions. This paradigm connects the worlds of natural language processing and computer vision.
It's a job of recognizing and evaluating the image's context before describing everything in a natural
language like English.
Our approach is based on two basic models: CNN (Convolutional Neural Network) and LSTM
(Long Short-Term Memory). CNN is utilized as an encoder in the derived application to extract
features from the snapshot or image, and LSTM is used as a decoder to organize the words and
generate captions. Image captioning can help with a variety of things, such as assisting the
visionless with text-to-speech through real-time input about the scenario over a camera feed, and
increasing social medical leisure by restructuring captions for photos in social feeds as well as
spoken messages.
Assisting children in recognizing chemicals is a step toward learning the language. Captions for
every photograph on the internet can result in faster and more accurate authentic photograph
exploration and indexing. Image captioning is used in a variety of sectors, including biology,
business, the internet, and in applications such as self-driving cars wherein it could describe the
scene around the car, and CCTV cameras where the alarms could be raised if any malicious activity
is observed. The main purpose of this research article is to gain a basic understanding of deep
learning methodologies.
Dept. of CS&E                                                                                       1
Image Caption Generator                                                                  2023-2024
The project focuses on developing an Image Caption Generator using deep learning methodologies,
specifically employing Convolutional Neural Networks (CNNs) as encoders and Long Short-Term
Memory networks (LSTMs) as decoders. This approach aims to enable machines to interpret visual
content and generate descriptive captions in natural language. By extracting features from images
with CNNs and generating coherent captions with LSTMs, the model bridges the gap between
computer vision and natural language processing. Applications range from enhancing accessibility
for visually impaired individuals through real-time image description to improving social media
engagement by automatically captioning and indexing visual content. Moreover, the technology
finds practical uses in sectors such as biology, business, and automotive industries, demonstrating
its broad impact on enhancing human-computer interaction and advancing technological capabilities
in understanding and processing visual information.
1.2 Objectives
1. The project aims to work on one of the ways to context a photograph in simple English sentences
using Deep Learning (DL).
2. The need to use CNN and LSTM models for this project.
1.3 Scope
1. Model Development: Creating a robust deep learning architecture using CNNs for image
    feature extraction and LSTMs for caption generation, optimizing for accuracy and efficiency.
2. Application: Deploying the model in practical scenarios such as aiding visually impaired
    individuals, automating social media content, and integrating into industries like autonomous
    vehicles and surveillance systems.
3. Ethical Considerations: Addressing issues like data privacy, bias mitigation in captioning, and
    ensuring ethical deployment of AI technologies to promote fairness and societal benefit.
Dept. of CS&E                                                                                      2
 Image Caption Generator                                                             2023-2024
 CHAPTER 2:
                                 LITERATURE SURVEY
                                                                                  FUTURE
                                      METHODOLOGY               RESULTS
SL.NO        PAPER DETAILS                                                        WORK/
                                          USED                 OBTAINED
                                                                                CONCLUSION
                                                                                The model
        Experimental Assessment of
                                                             This paper is      generates the
        Beam Search Algorithm for
                                                             aimed at a beam    basic caption
        Improvement in Image
                                                             search algorithm   through the aid of
        Caption Generation.               Case Study
  2                                                          for improvement    the LSTM and
        Chirani Lal Chowdhary, Aman
                                                             in image caption   RNN
        Goyal Bhavesh, Kumar
                                                             generation.        implementation
        Vasnani
                                                                                with InceptionV3.
        01-12-2019
 Dept. of CS&E                                                                                 3
Image Caption Generator                                                                  2023-2024
CHAPTER 3:
                                         METHODOLOGY
3.1 Introduction
The image caption generating project is loaded with CNN and LSTM which act as the platform to
generate the sentences from a simple image. This can be worked on all applications.
Dept. of CS&E                                                                                    4
Image Caption Generator                                                                    2023-2024
3.4 Algorithms
Convolutional Neural Network (CNN) is a type of deep learning model for processing data that has
a grid pattern, such as images.
  • Deep-learning CNN models to train and test, each input image will pass through a series of
       convolution layers with filters (Kernals), Pooling, fully connected layers (FC), and apply
       Softmax function to classify an object with probabilistic values between 0 and 1.
  • CNN's have unique layers called convolutional layers which separate them from RNNs and
       other neural networks.
  • Within a convolutional layer, the input is transformed before being passed to the next layer. A
       CNN transforms the data by using filters.
Dept. of CS&E                                                                                       5
Image Caption Generator                                                                     2023-2024
2. Features:
  • Depth: VGG16 is one of the first deep CNNs, deeper than previous models like AlexNet.
  • Simplicity: It follows a simple and uniform architecture, using the same filter size (3x3)
    throughout.
  • Performance: Achieved state-of-the-art results on ImageNet in 2014, showcasing the
    effectiveness of deep networks for image classification.
3. Usage:
  • Pre-Trained Model: Often used for transfer learning, leveraging pretrained weights on
    ImageNet.
  • Limitations: High computational cost due to its depth and parameter count, not ideal for
    resource-constrained devices.
Dept. of CS&E                                                                                        6
Image Caption Generator                                                                  2023-2024
LSTM networks are a type of recurrent neural network capable of learning order dependence in
sequence prediction problems This is a behavior required in complex problem domains like
machine translation, speech recognition, and more.
LSTMs are a complex area of deep learning. This is a behavior required in complex problem
domains like machine translation, speech recognition, and more. LSTMs are a complex area of deep
learning.
The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for
feature extraction on input data combined with LSTMs to support sequence prediction.
CNN-LSTMs were developed for visual time series prediction problems and the application of
generating textual descriptions from sequence of image (e.g., videos)Specifically, the problem of
Dept. of CS&E                                                                                       7
Image Caption Generator                                                                 2023-2024
3.9 Methodology
   1. Import Libraries.
   2. Upload Flickr8k Dataset. (Data Preprocessing).
   3. Apply CNN to identify the objects in the image.
   4. Preprocess and tokenize the captions.
   5. Use LSTM to predict the next word of the sentence.
   6. Make a Data Generator.
   7. View Images with caption.
Dept. of CS&E                                                                                       8
Image Caption Generator                                                                  2023-2024
The system architecture for the image captioning model is comprised of several key components.
Initially, it extracts image features using a pre-trained VGG16 model, which outputs high-level
image representations. These features are then used as inputs for a custom captioning model. The
captioning model consists of a dense layer with dropout to process image features, followed by an
LSTM-based sequence processing layer for the caption input. An attention mechanism is employed
to focus on relevant parts of the image features while generating captions. The model is trained
using categorical cross-entropy loss and the Adam optimizer, with data provided through a
generator that batches image features and tokenized captions.
After training, the model predicts captions for new images by generating one word at a time based
on the image features and previously generated words. The quality of generated captions is
evaluated using BLEU scores, which measure how closely the predicted captions match the ground
truth. The system also includes functions for visualizing the results by displaying the images along
with actual and predicted captions. This end-to-end pipeline enables automated generation and
evaluation of image descriptions.
Dept. of CS&E                                                                                    9
Image Caption Generator                                                                2023-2024
The image captioning system begins with data collection and preparation, where images and
captions are gathered and preprocessed. Image features are extracted using a pre-trained VGG16
model, and captions are tokenized and cleaned. During model training, a data generator creates
batches of features and caption sequences for the LSTM-based captioning model, which includes an
attention mechanism. The trained model is then used to generate captions for new images, with
performance evaluated using BLEU scores. Finally, results are visualized by displaying images with
their actual and predicted captions, and the trained model along with the tokenizer is saved for
future use.
Dept. of CS&E                                                                                  10
Image Caption Generator                                                                   2023-2024
CHAPTER 5:
                                    IMPLEMENTATION
The code implements an image captioning system using a combination of deep learning techniques
and natural language processing.
The code begins by importing necessary libraries for image processing (`PIL`, `numpy`,
`matplotlib`), deep learning (`tensorflow`), and text processing (`nltk`). It sets up directories for
input (`INPUT_DIR`) and output (`OUTPUT_DIR`) data.
  • VGG16 Model: Utilizes the VGG16 model pretrained on ImageNet for extracting image
     features. The model's classification layer is removed to access the penultimate layer's output,
     which serves as image features.
  • ‘extract_image_features’ Function: Loads each image from the directory (`flickr8k/
     Images`), preprocesses it to fit the VGG16 model requirements (`224x224` pixels,
     preprocess_input), and extracts features using the truncated VGG16 model. Extracted features
     are stored in a dictionary (`image_features`) with image IDs as keys.
  • Storage: Extracted image features are stored using `pickle` in `img_features.pkl` for later use.
  • Captions Loading: Reads captions from `captions.txt`, where each line contains an image ID
     and its associated captions.
  • Cleaning Captions: Preprocesses captions by converting to lowercase, removing non-
     alphabetical characters, trimming extra spaces, and adding start (`startseq`) and end (`endseq`)
     tokens. Cleaned captions are stored in a dictionary (`image_to_captions_mapping`).
Dept. of CS&E                                                                                     11
Image Caption Generator                                                                  2023-2024
4. Tokenization
  • Tokenizer: Tokenizes cleaned captions to convert words into integer tokens and build a
    vocabulary (`tokenizer`). The tokenizer is saved in `tokenizer.pkl` for later use.
  • Vocabulary Size: Determines the size of the vocabulary (`vocab_size`) and calculates the
    maximum caption length (`max_caption_length`) among all captions.
5. Data Splitting
  • Training and Testing Sets: Splits image IDs into training (`train_ids`) and testing (`test_ids`)
    sets for model evaluation. By default, 90% of the data is used for training.
6. Data Generator
  • ‘data_generator’ Function: Generates batches of data for model training and validation. It
    iterates through image IDs, extracts image features and corresponding token sequences, and
    yields batches (`X1_batch`, `X2_batch`, `y_batch`) for training the model.
7. Model Architecture
  • Encoder-Decoder Architecture: Defines a deep learning model using Keras functional API:
    - Encoder: Processes image features (`inputs1`) through dense and LSTM layers to extract
       context (`fe2_projected`).
    - Decoder: Takes token sequences (`inputs2`) through embedding and LSTM layers to
       generate captions.
8. Model Training
  • Training Loop: Trains the defined model over multiple epochs (`epochs`) using
    `data_generator` for both training and validation data. It compiles the model with categorical
    cross-entropy loss and Adam optimizer.
  • Model Saving: Saves the trained model (`mymodel.h5`) in the `output_dir`.
Dept. of CS&E                                                                                    12
Image Caption Generator                                                               2023-2024
9. Model Evaluation
  • BLEU Score Calculation: Evaluates model performance on test data using BLEU-1 and
    BLEU-2 scores (`corpus_bleu` from `nltk`). Compares actual vs. predicted captions to assess
    captioning quality.
  • Example: Demonstrates how to use `generate_caption` to generate and display captions for a
    sample image (“101669240_b2d3e7f17b.jpg”) after training the model.
Dept. of CS&E                                                                                 13
Image Caption Generator                                                       2023-2024
# Basic libraries
import os
import pickle
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import warnings
warnings.filterwarnings('ignore')
from math import ceil
from collections import defaultdict
from tqdm.notebook import tqdm # Progress bar library for Jupyter Notebook
Dept. of CS&E                                                                         14
Image Caption Generator                                                             2023-2024
Dept. of CS&E                                                                              15
Image Caption Generator                                                                        2023-2024
  tokens = line.split(',')
  if len(tokens) < 2:
     continue
  image_id, *captions = tokens
  image_id = image_id.split('.')[0]
  caption = " ".join(captions)
  image_to_captions_mapping[image_id].append(caption)
# Preprocess captions
def clean_captions(mapping):
  for key, captions in mapping.items():
     for i in range(len(captions)):
       caption = captions[i]
       caption = caption.lower()
       caption = ''.join(char for char in caption if char.isalpha() or char.isspace())
       caption = caption.replace('\s+', ' ')
       caption = 'startseq ' + ' '.join([word for word in caption.split() if len(word) > 1]) + ' endseq'
       captions[i] = caption
  return mapping
# Clean captions
clean_captions(image_to_captions_mapping)
# Tokenize captions
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_captions)
# Save tokenizer
with open('tokenizer.pkl', 'wb') as tokenizer_file:
Dept. of CS&E                                                                                              16
Image Caption Generator                                                          2023-2024
pickle.dump(tokenizer, tokenizer_file)
Dept. of CS&E                                                                             17
Image Caption Generator                                                                     2023-2024
               y_batch.append(out_seq)
               batch_count += 1
               if batch_count == batch_size:
                  yield ([np.array(X1_batch), np.array(X2_batch)], np.array(y_batch))
                  X1_batch, X2_batch, y_batch = [], [], []
                  batch_count = 0
       except Exception as e:
          print(f"Exception occurred: {e}")
inputs2 = Input(shape=(max_caption_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = Bidirectional(LSTM(256, return_sequences=True))(se2)
Dept. of CS&E                                                                                          18
Image Caption Generator                                                                             2023-2024
Dept. of CS&E                                                                                                19
Image Caption Generator                                                                 2023-2024
Dept. of CS&E                                                                                  20
Image Caption Generator                         2023-2024
  print(y_pred)
  plt.imshow(image)
  plt.axis('off')
  plt.show()
# Example usage
generate_caption (“101669240_b2d3e7f17b.jpg”)
Dept. of CS&E                                          21
Image Caption Generator                                                              2023-2024
CHAPTER 5:
                                             RESULT
The project aims to develop an Image Caption Generator using Convolutional Neural Networks
(CNNs) for image feature extraction and Long Short-Term Memory networks (LSTMs) for
generating descriptive captions. This fusion of computer vision and natural language processing
enables machines to interpret visual content and produce coherent sentences in natural language.
Applications include enhancing accessibility for visually impaired individuals through real-time
image description, automating social media content by automatically captioning visual posts, and
supporting industries such as autonomous vehicles and surveillance systems.
The methodology involves utilizing CNNs to extract meaningful features from images and
employing LSTMs to predict and generate captions. The CNN layers process images to identify
objects, while LSTM sequences these features into understandable sentences. The project's scope
covers model development, application deployment, and ethical considerations like data privacy
and bias mitigation in AI technologies. By leveraging deep learning techniques like CNNs and
LSTMs, the project aims to advance human-computer interaction capabilities in understanding and
processing visual information effectively.
Dept. of CS&E                                                                                22
Image Caption Generator                                                     2023-2024
                                  SNAPSHOTS
             The output with Actual and Predicted Captions is as follows:
Figure 1: Output 1
Figure 2: Output 2
Dept. of CS&E                                                                      23
Image Caption Generator                        2023-2024
Figure 3: Output 3
Figure 4: Output 4
Dept. of CS&E                                         24
Image Caption Generator                                                                      2023-2024
CONCLUSION
The integration of CNNs and LSTMs in Image Caption Generator models marks a significant
advancement at the intersection of computer vision and natural language processing. CNNs excel in
extracting detailed features from images, while LSTMs effectively organize these features into
coherent sentences, enabling machines to describe visual content accurately in natural language.
This technology offers practical benefits across various fields. It enhances accessibility by
providing real-time image descriptions for visually impaired individuals and automates social media
interactions through meaningful image captions. Industries such as autonomous vehicles and
surveillance systems also leverage its ability to interpret and articulate visual scenes with precision.
However, ethical considerations, such as safeguarding data privacy, mitigating biases in captioning,
and ensuring responsible AI deployment, are critical for fostering trust and ensuring equitable
outcomes in society.
In summary, this project highlights the transformative potential of deep learning in improving
human-computer interaction and advancing capabilities in visual information processing. Future
research should focus on enhancing model efficiency, refining performance metrics, and addressing
ethical implications to broaden the application and societal acceptance of AI technologies.
Dept. of CS&E                                                                                         25
Image Caption Generator                                                               2023-2024
REFERENCES
[1] R. Subash (November 2019): Automatic Image Captioning Using Convolution Neural Networks
and LSTM.
[2] Seung-Ho Han, Ho-Jin Choi (2020): Domain-Specific Image Caption Generator with Semantic
Ontology.
[3] Pranay Mathur, Aman Gill, Aayush Yadav, Anurag Mishra and Nand Kumar Bansode (2017):
Camera2Caption: A Real-Time Image Caption Generator
[4] Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (June 2019): Image Captioning:
Transforming Objects into words.
[5] Manish Raypurkar, Abhishek Supe, Pratik Bhumkar, Pravin Borse, Dr. Shabnam Sayyad (March
2021): Deep learning-based Image Caption Generator
[6] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2015):
Show and Tell: A Neural Image Caption Generator
[7] Jianhui Chen, Wenqiang Dong, Minchen Li (2015): Image Caption Generator based on Deep
Neural Networks
[8] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,
and Lei Zhang. (2017): Bottom-up and top-down attention for image captioning.
[9] Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing (2018): Convolutional image
captioning.
[10] Shuang Bai and Shan An (2018): A survey on automatic image caption generation.
Dept. of CS&E 26