A
Mini Project Report
on
“AI Based Image Captioning System”
In partial fulfillment of the requirement for the degree
Of
Bachelor of Technology
In
Computer Science and Engineering (Data Science)
SUBMITTED BY:
Anubhav Singh(231331540033)
Sutar Prarthana(2301331540163)
Pihu Gupta(2301331540104)
Under Supervision of:
Ms Mona Devi
(Assistant Professor, Data Science)
NOIDA INSTITUTE OF ENGIEERING AND TECHNOLOGY GREATER NOIDA
Acknowledgment
It is a great privilege for us to express our profound gratitude to our
respected teacher Ms Mona Devi , Noida Institute of Engineering &
Technology GR. Noida, for his/her constant guidance, valuable
suggestions, supervision and inspiration throughout the course work
without which it would have been difficult to complete the work
within scheduled time. We are also indebted to the Head of the
Department Dr Manali Gupta, Data Science, Noida Institute of
Engineering & Technology for permitting us to pursue the project.
We would like to take this opportunity to thank all the respected
teachers of this department for being a perennial source of inspiration
and showing the right path at the time of necessity.
Anubhav Singh(2301331540033)
Sutar Prarthana(2301331540163)
Pihu Gupta(2301331540104)
CERTIFICATE
I hereby certify that the work which is being submitted in the Project Report
entitled “AI Based Image Captioning System” in partial fulfillment of the
requirements for the award of the Bachelor of Technology in Data Science and
submitted to the Department of Data Science, Noida Institute of Engineering &
Technology, Greater Noida is an authentic record of my Project carried out
during 4th Semester under the supervision of Ms Mona Devi, Assistant
Professor, Department of Data Sciecne, Noida Institute of Engineering &
Technology, Greater Noida. The matter embodied in this project Report is
original and has not been submitted for any other degree or diploma award
Anubhav Singh(2301331540033)
Sutar Prarthana(2301331540163)
Pihu Gupta(2301331540104)
This is to certify that the above statement made by the candidate is correct and
true to the best
of my knowledge.
Signature of Guide
INDEX
ABSTRACT
SR.NO. TOPICS PAGE NO.
1 Abstract 5
2 Introduction 6-7
3 Literature Review 8-9
4 Objective and Scope 10
5 Methodology 11-12
6 Module Description 13-14
7 Result 15
8 Conclusion 16
9 Appendix 17-19
10 Reference 20
This project introduces an advanced Image Captioning System designed to
generate intelligent and personalized descriptions for user-uploaded images.
Integrating state-of-the-art computer vision and natural language processing
models, the system automatically interprets visual elements and generates
accurate, contextually relevant captions. What sets this system apart is its high
level of customization: users can select from various caption styles such as
formal, casual, poetic, or humorous to match their desired tone. Additionally,
users can input optional prompts to influence the content of the caption further.
The system also offers flexible output formats, including plain text, emoji-
enhanced descriptions, and hashtag-rich captions—making it highly useful for
social media sharing, storytelling, digital marketing, and content creation. By
merging deep learning with user-centric controls, the Image Captioning System
enhances creativity, accessibility, and user engagement in visual
communication.
The platform also supports diverse output formats, including plain text, emojis,
and hashtags, making it ideal for social media, storytelling, and marketing use
cases.
1.INTRODUCTION
The rise of digital media has significantly increased the volume of
visual content shared and consumed daily. From social networking
platforms to e-commerce sites and educational resources, images have
become a dominant medium for communication and storytelling.
However, without accompanying textual descriptions, images often
lack context and accessibility—particularly for users with visual
impairments or those relying on search engines and content indexing
tools. This has led to a growing demand for intelligent image
captioning systems capable of automatically generating descriptive
and meaningful captions.
Image captioning is a challenging task that lies at the intersection of
computer vision and natural language processing (NLP). It
involves analyzing the content of an image and producing a coherent
textual description that reflects the objects, actions, context, and
emotions depicted. Recent advancements in deep learning have
significantly improved the performance of image captioning systems,
especially with the use of pretrained multimodal models.
In this project, we develop a user-centric Image Captioning System
that not only generates accurate captions but also allows for extensive
customization.
The backend of the system utilizes Firebase for efficient, scalable,
and real-time data storage and user management. This ensures
seamless interaction between the frontend and backend while securely
storing user-uploaded images and generated captions.
For caption generation, the system employs a pretrained Gemini
model, known for its advanced multimodal understanding. Gemini is
capable of interpreting complex visual data and generating
linguistically rich descriptions that align with the chosen style and
user input. This approach eliminates the need for training models
from scratch, reduces computational overhead, and enhances system
performance.
By combining state-of-the-art AI technology with a highly
customizable and interactive user interface, this project seeks to make
image captioning not just automated, but expressive and personal. The
goal is to enhance user creativity, improve accessibility, and support a
wide range of real-world applications—from social media content
generation to visual storytelling and marketing
2.Literature Review
Image captioning, which integrates computer vision and natural
language processing (NLP), has become a focal point in AI research.
This field is essential for generating descriptive captions based on
visual content, with applications in accessibility, social media,
marketing, and creative industries. The following review highlights
significant advancements in image captioning and related
technologies, providing a foundation for the current project.[1]
The concept of image captioning has evolved significantly with the
development of deep learning techniques. Vinyals et al. (2015)
pioneered an end-to-end model combining convolutional neural
networks (CNNs) and recurrent neural networks (RNNs) to
automatically generate captions from images. This model marked a
significant shift from earlier, template-based approaches,
demonstrating that deep learning could improve the relevance and
fluency of captions. Their work laid the foundation for modern
approaches to image captioning and remains a crucial reference for
the current project.[2]
As deep learning models progressed, pretrained models emerged as
a vital tool for image captioning. The Gemini model, which this
project leverages, is an advanced pretrained transformer model. Li et
al. (2019) highlighted the success of pretrained transformers, which
integrate visual feature extraction with language generation
capabilities. These models can generate highly accurate and
contextually rich captions, making them an excellent choice for
automated image description generation. The ability to fine-tune such
models on domain-specific data further improves their application
across diverse industries.[3]
However, customization remains a significant challenge in image
captioning systems. Users often desire captions tailored to specific
contexts, such as tone, style, or sentiment. Anderson et al. (2017)
explored how incorporating contextual attributes, such as sentiment
or specific stylistic choices, can enhance the personalization of
captions. Their work provides a framework for customizing captions
based on user preferences, which is a critical feature in the current
project.[4]
Furthermore, recent trends in digital communication have led to an
increasing use of emojis and hashtags alongside traditional text
captions. Gatt and Krahmer (2018) discussed the growing use of
emotion-based symbols like emojis and hashtags, which add an
expressive layer to text captions, particularly on social media
platforms. This shift toward more interactive and visually engaging
captions is particularly relevant for applications that aim to enhance
user engagement, such as this image captioning system. Providing
options for users to choose between plain text, emojis, or hashtags
helps tailor the captions to different communication needs.
The backend infrastructure supporting these systems is also critical
for ensuring performance and reliability. Firebase, a cloud-based
storage platform, is widely used in image captioning applications for
real-time data storage. Firebase allows seamless storage of images
and captions, ensuring smooth and efficient management of data. The
scalability and security features of Firebase make it ideal for
applications that handle large volumes of user-generated content.
According to the Firebase documentation (2021), the platform's ease
of integration with various web and mobile frameworks makes it a
preferred choice for developers.
In terms of user interaction, the design and simplicity of the
interface are critical for engagement. Nielsen (2012) emphasizes that
a clean, user-friendly design is essential for improving user
experience and minimizing friction. The current project follows these
principles by ensuring that the image captioning interface is
straightforward and accessible, allowing users to upload images,
choose caption styles, and receive output with minimal effort.
While various systems have been developed to automate image
captioning, few focus on customization and personalization. This
project aims to fill that gap by offering a flexible, user-driven
approach to caption generation. By allowing users to select from
different caption styles, formats (e.g., emojis, hashtags), and tone
adjustments, this system provides a more engaging and tailored
experience.
In summary, existing research has demonstrated the potential of deep
learning models and pretrained transformers in improving the
accuracy and relevance of image captions. Additionally, the need for
customization and personalization is becoming more pronounced as
users seek captions that align with their communication preferences.
This project builds on existing research by integrating Gemini,
Firebase, and a user-centric interface to create a flexible image
captioning system that offers both high-quality outputs and extensive
user customization.
3.OBJECTIVE AND SCOPE
The primary goal of this project is to create an interactive image
captioning system that allows users to upload images and generate
personalized captions using a pretrained deep learning model. The
system combines computer vision and natural language processing to
produce contextually accurate and creative descriptions, offering users
a flexible, user-friendly experience. By integrating real-time data
storage and customization options, the project aims to provide a
versatile platform for generating image captions tailored to various
needs, such as social media, marketing, and educational content.
Key objectives include:
Developing an image captioning system using the pretrained
Gemini model to generate contextual captions based on image
content.
Enabling user customization by offering various caption styles
(e.g., formal, casual, poetic, humorous) and additional prompt
inputs.
Supporting multiple output formats (text, emojis, hashtags)
for diverse communication scenarios.
Using Firebase for secure, real-time storage of images and
captions.
Creating an intuitive web interface that allows easy
interaction for users with minimal technical expertise.
The scope of this project covers the design, development, and
deployment of a web-based application for intelligent image
captioning. The system leverages a pretrained Gemini model for
generating captions based on uploaded images and allows users to
customize the output through various stylistic choices and formats.
Firebase is used for secure image and data storage.
The project is designed for educational, creative, and marketing
purposes, offering users an easy-to-use tool for generating
personalized image captions. The application does not include
advanced image recognition, GPS integration, or mobile app
development but is designed with a modular structure for future
expansions, such as multilingual support, voice-based inputs, or
community sharing features.
4.Methodology
Image Captioning System follows a structured, modular approach that ensures
smooth interaction between the user interface, AI model, and backend services.
The system is designed to provide an end-to-end pipeline that starts with user
input and ends with customized, AI-generated captions stored for future access
or sharing.
Step
Stage Description
No.
User selects and uploads an image through
1 Image Upload
the system’s user interface.
User selects desired caption style (e.g.,
formal, casual, poetic, humorous), output
2 Input Customization
format (plain, emoji, hashtag), and optional
prompt (if any).
Uploaded image is preprocessed to ensure
Image
3 compatibility with the model (resizing,
Preprocessing
normalization, etc.).
Image is passed to the pretrained Gemini
4 Feature Extraction model, which processes and extracts
semantic features from the visual content.
The Gemini model generates a base caption
5 Caption Generation based on the extracted features and any
additional user input (prompt or style).
The raw caption is post-processed to match
Caption user-selected style and output format (e.g.,
6
Personalization converting words to emojis, adding
hashtags, rephrasing tone).
Image, user inputs, and generated caption
8 Data Storage are stored in the database for retrieval or
history tracking.
User can provide feedback or rating for
Optional Feedback
9 generated caption quality to support future
Logging
system improvements.
The development of the Image Captioning System adopted a
systematic and modular approach designed to harmonize AI
capabilities with user-centered design. The primary objective was to
create an end-to-end platform where users could upload images and
receive intelligently generated captions that reflect not only the image
content but also their stylistic preferences.
To begin, efforts were focused on building an intuitive interface that
would facilitate image uploads and allow users to choose caption
styles and output formats. Design tools were utilized to create a clean,
responsive front-end layout that supports interaction across devices.
The frontend served as a crucial bridge between users and the
underlying AI services, capturing preferences such as tone (formal,
casual, poetic, humorous) and format (plain text, emojis, hashtags).
On the backend, a robust framework was established using Python
and Flask to manage data flow and system logic. Preprocessing
modules were integrated to handle tasks such as resizing, format
conversion, and normalization, preparing each uploaded image for
seamless model interaction. This was a critical step to ensure
consistent performance and compatibility with the pretrained Gemini
model used for caption generation.
The heart of the system lay in the integration of the Gemini model—a
sophisticated transformer-based architecture trained on diverse visual-
linguistic datasets. Upon receiving a processed image, the model
extracted semantic features and generated a base caption. This output
was then passed through a post-processing layer that personalized the
caption according to the user’s inputs, enhancing engagement and
relevance.
In parallel, a Firebase Realtime Database was configured to log all
interactions, including user-uploaded images, their selections, and the
generated captions. This setup ensured persistent access to caption
history and opened opportunities for data-driven refinement.
Additionally, a feedback mechanism allowed users to rate captions,
contributing insights for future model and system improvements.
Throughout the development process, testing and optimization played
a vital role. The system underwent multiple rounds of evaluation to
fine-tune its performance across various scenarios and image types.
Challenges around processing speed, model output diversity, and UI
responsiveness were addressed through iterative debugging and
enhancement.
This methodical creation journey underscores the fusion of machine
learning, user interface design, and backend engineering. The
resulting system not only automates image description generation but
also delivers a customizable and accessible experience—bridging the
gap between AI-generated content and human creativity.
5.Module Description
The project is structured into several interrelated
modules that collectively deliver an AI-powered, user-
friendly image captioning platform. Each module is
designed for scalability, maintainability, and high
performance, enabling a seamless experience for users.
The following are the key modules:
1.CaptionGenerator Component:
This is the core frontend module where users
interact with the application. It handles image
upload via drag-and-drop or file selection, allows
users to choose caption styles (humorous, poetic,
formal, etc.), and provides an input area for
additional context. It dynamically manages the UI
state, invokes AI captioning APIs, handles error
responses, and displays the generated captions.
2.AI Flow Module:
This backend module orchestrates the AI
captioning logic. It contains pre-defined prompt
templates that structure image metadata into
coherent prompts. It uses schema validation to
ensure proper request formatting and processes
responses through sanitization and style-check
routines. This module can be easily extended to
support multi-language or tone-specific captions.
3.API Routes:
Built using Next.js API routes, this module manages
communication between frontend and backend.
The /api/generate endpoint forwards image data to
the AI system, while /api/save persists image-
caption pairs to firebase. Error handling,
authentication hooks, and data transformation
middleware are part of this layer.
4.Database Layer:
This module is responsible for storing and
retrieving user-uploaded images and generated
captions. Each image-caption pair is stored as a
document with metadata (timestamp, style, user
notes). The database schema is optimized for
quick querying and future analytics on user
behavior and AI performance.
5.ImagePreview& History Components:
These UI modules enhance user engagement by
displaying previously uploaded images and their
associated captions. Users can browse, reuse, or
modify past captions, encouraging content
experimentation and reuse. Pagination, sorting,
and search functionalities are integrated.
6.Utility and Shared Components:
Reusable UI elements such as buttons, file inputs,
modals, toasts, and alert boxes are encapsulated
in this module. Custom hooks manage user
notifications, loading states, and responsiveness,
ensuring consistency across the app. Tailwind
CSS is employed for styling.
7.Error & Feedback Handler:
This module logs client-side and server-side errors,
tracks failed AI calls, and shows contextual
messages to users. Feedback mechanisms allow
users to rate generated captions or flag
inaccuracies, which can be used to improve
prompt engineering in future iterations.
Fig.1: Model Implementation Flowchart
7.Result
Following are the snapshots of our application working:
Fig.2:Snapshots of working application
8.Conclusion
The project demonstrates the powerful integration
of artificial intelligence with modern web
development technologies to solve a real-world
problem: generating meaningful and creative
captions for images. By combining a clean and
responsive frontend built with Next.js, React, and
Tailwind CSS with a robust backend powered by
Fireball and AI prompt flows, the application
delivers a seamless user experience for image
captioning.
This tool not only reduces the time and cognitive
load involved in writing captions but also
empowers users—social media influencers,
marketers, bloggers, and everyday users—to
enhance the storytelling potential of their visuals.
The modular structure of the application ensures
maintainability, scalability, and flexibility for future
enhancements.
Through its AI-driven caption styles, contextual
input options, and caption history management,
Gemini Captionator bridges the gap between
technology and creativity. It also sets a foundation
for potential future upgrades such as multi-
language support, voice-to-caption input, advanced
style personalization, and integration with social
media platforms.
In essence, project is a practical, user-centric
application that brings the power of AI to everyday
content creation, showcasing the potential of
intelligent systems to enrich digital
communication.
Appendix
System Configuration
The system configuration used in the development
of the project includes the following:
Framework: Next.js 13 with App Router
Language: TypeScript with React 18
Styling: Tailwind CSS 3.x
Database: firebase
AI Integration: Gemini API (Google AI Studio) with
custom prompt-based flows
Development Environment: Visual Studio Code with
Node.js LTS (18.x)
Deployment Platform: Vercel (CI/CD Enabled)
Browser Support: Chrome, Firefox, Edge, Safari
(latest versions)
C. Image & Caption Input Samples
Fig.3:Snapshots of login page
The application supports user-uploaded images
with optional context and style selection. Below are
a few example images and their corresponding
caption types generated by the system.
Fig.4:Snapshots of Dashboard
Image Upload Example - Anime
Style Chosen: Poetic
Generated Caption: "A futuristic Samurai, ready for
only cyberpunk challenge"
Image Upload Example –
Fig.5:Snapshots of test Result
D. Testing Results
The system was tested for key functional areas to
ensure a smooth and efficient user experience.
Test Case 1: Image Upload and Preview
Objective: Verify successful image upload and
preview rendering.
Result: Image uploads via both drag-and-drop and
file selection were recognized with 100% reliability
and rendered within 1 second.
Test Case 2: Caption Generation Accuracy
Objective: Ensure the AI accurately interprets
image content and applies chosen style.
Result: Captions matched image context in 9 out of
10 trials. Humor, poetic, and formal styles were
distinct and coherent.
Test Case 3: Backend Data Storage
Objective: Confirm captions and images are saved
to Firebase and retrievable.
Result: Each image-caption pair was successfully
stored with metadata and fetched under history
view.
Test Case 4: Responsive UI Behavior
Objective: Test layout across various screen sizes
and devices.
Result: Tailwind CSS rendered all components
correctly on desktop, tablet, and mobile screens.
Below are visual examples from the application
during testing and development:
References
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). “Show
and Tell: A Neural Image Caption Generator.” In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). https://arxiv.org/abs/1411.4555
Xu, K., Ba, J., Kiros, R., et al. (2015). “Show, Attend and Tell:
Neural Image Caption Generation with Visual Attention.”
Proceedings of the 32nd International Conference on Machine
Learning. https://arxiv.org/abs/1502.03044
Sharma, A., Singh, S., & Jain, R. (2022). “Artificial Intelligence-
Based Auto Image Caption Generator for Business Accounts on
Social Media.” ResearchGate.
https://www.researchgate.net/publication/369827730
Firebase, Inc. (2023). “How to Integrate Firebase Into Your
Next.js App.” Firebase Developer Center.
https://www.mongodb.com/developer/languages/javascript/
nextjs-with-mongodb/
Tailwind Labs. (2023). “Tailwind CSS Documentation: Responsive
Design.”
https://tailwindcss.com/docs/responsive-design
OpenAI. (2023). “ChatGPT Technical Report.”
https://openai.com/research/chatgpt
GeeksforGeeks. (2022). “How to Integrate Firebase in Next.js.”
https://www.geeksforgeeks.org/how-to-integrate-Firebase-in-next-js/