0% found this document useful (0 votes)

8 views9 pages

Papers

This research explores automated image captioning and voice generation using deep learning technologies, specifically CNNs for image processing and LSTMs for text generation. It highlights the importance of these technologies in enhancing accessibility for visually impaired individuals and improving user engagement across multimedia platforms. The study also addresses challenges such as dataset limitations and model accuracy while proposing solutions for future research directions.

Uploaded by

kunalp.patil141

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views9 pages

Papers

Uploaded by

kunalp.patil141

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

www.ijcspub.

Automated Image Captioning And Voice

Generation Using Deep Learning Technologies
1
Shrikrushna Deore, 2prachi lalage, 3Shreyas Ghodchore, 4 Prof.Prachi Waghmare
1
Student, 2Student, 3Student, 4Assistant Professor,
1
Department of Computer Engineering,
1
Nutan Maharashtra Institute of Engineering and Technology, Pune, India.

Abstract: Automated image captioning and voice generation have emerged as transformative technologies,
enabling machines to interpret visual content and generate human-like descriptions. This research examines
the incorporation of deep learning models, focusing on CNNs for image processing and Recurrent Neural
Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, for producing descriptive text.
The research further investigates the role of text-to-speech (TTS) systems in converting these generated
captions into natural-sounding speech. These technologies are crucial for improving accessibility, particularly
for visually impaired individuals, and enhancing user engagement across multimedia platforms. The study
highlights the impact of automated image captioning and voice generation in content creation, education, and
accessibility. Challenges such as dataset availability, model accuracy, and computational complexity are
discussed, with a focus on potential solutions and future research directions. Ultimately, the findings
underscore the potential of these technologies to foster more inclusive, interactive, and engaging digital
experiences.

Index Terms - Automated Image Captioning, Deep Learning, Convolutional Neural Networks, Recurrent
Neural Networks.
I. INTRODUCTION

The development of automated image captioning and voice generation systems has transformed how
artificial intelligence interprets and communicates visual content. These technologies enable machines to
analyze images, describe them in natural language, and convert the text into human-like speech. Their
applications are particularly valuable for enhancing accessibility for visually impaired users, improving
multimedia interactions, and automating content generation for various platforms. Deep learning methods have
significantly contributed to the advancement of these systems. Convolutional Neural Networks (CNNs) are
commonly utilized for extracting relevant features from images, whereas Recurrent Neural Networks (RNNs),
particularly Long Short-Term Memory (LSTM) networks, excel in generating sequential text. Additionally,
Transformer-based architectures such as Attention Mechanisms further enhance caption accuracy by
dynamically emphasizing relevant image regions.

Text-to-speech (TTS) synthesis has also seen notable improvements with deep learning. Traditional speech
synthesis techniques have largely been replaced by models such as WaveNet and Tacotron 2, which generate
natural and expressive speech using neural networks. By integrating image processing, text generation, and
voice synthesis, a cohesive and efficient system can be developed to transform visual data into spoken language
in real time. Although significant advancements have been made, challenges such as dataset limitations, model
efficiency, and computational complexity persist. This research focuses on optimizing automated image
captioning and voice generation by addressing these challenges and evaluating their practical implementation.

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 959

www.ijcspub.org © 2025 IJCSPUB | Volume 15, Issue 1 March 2025 | ISSN: 2250-1770
II. SYSTEM ARCHITECTURE

1. Image Processing Module: The system begins with the image processing module, which utilizes
Convolutional Neural Networks (CNNs) such as ResNet or VGGNet. These networks extract high-level
features from input images. Feature extraction is crucial for understanding the content and context of the image.
Preprocessing techniques such as normalization, resizing, and noise reduction are applied to ensure higher
accuracy and consistency in feature extraction.

2. Caption Generation Module: After extracting image features, they are fed into the caption generation
module, which leverages Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.
This module follows an encoder-decoder architecture, where the CNN-based encoder transforms the image
into a feature vector, and the LSTM-powered decoder sequentially generates captions. To enhance contextual
accuracy, the decoder employs an attention mechanism, allowing it to dynamically focus on different regions
of the image while producing descriptive text. The model is optimized using beam search and reinforcement
learning techniques to enhance caption coherence and diversity.

3. Text-to-Speech (TTS) Module: The generated textual caption is then fed into the TTS module, which
converts it into human-like speech. This module employs advanced neural vocoders such as WaveNet,
Tacotron 2, or FastSpeech. The process begins with linguistic processing, where the text is tokenized and
converted into phonemes. Next, prosody modeling is applied to add stress, intonation, and rhythm, making the
speech sound more natural. Finally, the neural vocoder synthesizes high-quality audio output, ensuring clarity
and expressiveness in generated speech.

Fig.1.System Architecture
III. MATHEMATICAL MODEL

Mathematical Model for Automated Image Captioning and Voice Generation

The proposed system for automated image captioning and voice generation is composed of multiple
components, each of
which can be mathematically formulated. Below are the key elements of the model:
1. Image Feature Extraction
An input image is expressed as a matrix of pixel values, where:
 H denotes the image height,
 W represents the image width,
 C indicates the number of color channels
A Convolutional Neural Network (CNN) processes the image to extract meaningful features, resulting in a
feature vector.
IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 960
www.ijcspub.org © 2025 IJCSPUB | Volume 15, Issue 1 March 2025 | ISSN: 2250-1770
where:
 is the CNN function,
 denotes the parameters of the CNN,
 represents the dimensionality of the feature vector.

2. Caption Generation
The extracted feature vector serves as input to a sequence generation model, such as a Recurrent Neural
Network (RNN) or
Long Short-Term Memory (LSTM) network. The goal is to generate a sequence of words forming a
caption:
where consists of words.
Caption generation follows a probabilistic sequence model:
Each word is generated using a softmax function:
where:
 h_t represents the hidden state of the RNN/LSTM at time step t.
 W_h denotes the weight matrix linked to the hidden state.
 V signifies the vocabulary size.
3. Loss Function for Caption Generation
The caption generation model is trained using cross-entropy loss:
where:
 N represents the total number of training samples.
 y_t^i denotes the target word at time step t for the i-th training sample.

4. Text-to-Speech Conversion
Given the generated caption , the corresponding audio output is produced by a text-to-speech (TTS) model:
where:
 represents the TTS model function,
 denotes the model parameters.
Speech quality can be assessed using subjective evaluation metrics like the Mean Opinion Score (MOS).
5. Evaluation Metrics
The performance of the caption generation model is assessed using various evaluation metrics, including:
BLEUScore
The BLEU (Bilingual Evaluation Understudy) score evaluates how closely the generated captions match
the reference
captions by measuring their similarity.
where:
 represents the reference caption,
 is the brevity penalty,
 denotes the precision of -grams.

CIDEr Score
The CIDEr (Consensus-based Image Description Evaluation) score measures the level of agreement
between generated captions and reference captions by assessing their relevance and consistency.

where is the total number of captions, and (Inverse Document Frequency) adjusts the importance of
words.

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 961

www.ijcspub.org © 2025 IJCSPUB | Volume 15, Issue 1 March 2025 | ISSN: 2250-1770
IV. ALGORITHMS

1. Image Feature Extraction using Convolutional Neural Networks (CNNs)

Algorithm: ResNet/VGGNet
ResNet (Residual Networks) and VGGNet (Visual Geometry Group Network) are deep CNN architectures
used for extracting high-level image features.
Process:
1. Preprocessing the Input Image:
o The input image is resized to a fixed dimension to ensure consistency.
o Pixel values are normalized to improve model performance.
o Data augmentation techniques may be applied during training for better generalization.
2. Feature Map Extraction Using CNN Layers:
o The image passes through multiple convolutional layers, activation functions, and pooling
layers.
o Lower layers detect basic patterns like edges, while deeper layers extract high-level features
representing objects and context.
3. Convert Features into a Structured Vector Representation:
o The final convolutional layer outputs a feature map that encodes spatially important
information.
o The last fully connected layer is removed, and the extracted feature vector is used as input for
the next stage.

2. Caption Generation using LSTM with Attention

Algorithm: Encoder-Decoder LSTM with Attention
The model generates meaningful captions from image features by combining a CNN encoder with an LSTM
decoder and an attention mechanism.
Process:
1. Encode CNN Features and Initialize the LSTM Decoder:
o The extracted feature vector is processed through an LSTM encoder, which converts it into a
hidden state.
o This hidden state serves as the initial context for generating captions.
2. Apply an Attention Mechanism:
o The attention mechanism dynamically selects relevant parts of the image while generating each
word in the caption.
o Attention weights determine which regions contribute most to the next word in the sequence.
3. Generate a Text Sequence using the Softmax Function:
o The decoder LSTM predicts words sequentially, assigning probabilities to possible words using
the softmax function.
o The most probable word is selected at each step.
4. Optimize Caption Fluency with Beam Search:
o Beam search maintains multiple possible sequences to improve the quality of generated
captions.
o This results in more coherent and contextually relevant captions.

3. Text-to-Speech (TTS) Conversion using Neural TTS (WaveNet/Tacotron 2)

Algorithm: WaveNet / Tacotron 2
Neural TTS models convert generated text captions into natural-sounding speech.
Process:
1. Convert Text into Phoneme Sequences:
o The input text is tokenized and converted into phonemes, ensuring correct pronunciation.
2. Apply Prosody Modeling for Pitch and Rhythm:
o The model predicts speech characteristics such as pitch, duration, and rhythm to enhance
naturalness.
3. Synthesize Speech Waveforms using a Neural Vocoder:
o A mel-spectrogram is generated, representing the frequency distribution over time.

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 962

www.ijcspub.org © 2025 IJCSPUB | Volume 15, Issue 1 March 2025 | ISSN: 2250-1770
o A neural vocoder processes the spectrogram to produce a high-quality speech waveform.
4. Perform Post-processing to Refine Speech Quality:
o The final waveform is refined through denoising and filtering techniques.
o Adjustments ensure smooth, natural-sounding speech output.

V. METHODOLOGY

The approach to automated image captioning and voice generation is systematically designed, incorporating
deep learning models for image processing, text generation, and speech synthesis. The fundamental steps in
this methodology are as follows:

1. Data Collection and Preprocessing

• Large-scale datasets such as MS COCO, Flickr8k, and ImageNet are utilized to train the image
captioning model. These datasets contain annotated images with human generated captions, which serve
as ground truth for supervised learning.
• The collected images are preprocessed by resizing, normalizing, and removing noise to enhance
feature extraction. Augmentation techniques like rotation, flipping, and brightness adjustment are
applied to improve model generalization.
• Text captions undergo tokenization and padding to standardize input for the language model.
Stopword removal and stemming techniques are used to clean the text and reduce redundancy.

2. Image Feature Extraction

A pre-trained Convolutional Neural Network (CNN), such as ResNet or VGGNet, is utilized to extract
high-dimensional feature representations from input images.
The extracted feature vectors are then processed through a fully connected layer to compress them into
a suitable format for the language model input.
These features serve as input to the caption generation model, providing a structured representation of
image content.

3. Caption Generation Using LSTM-Based Model

• The caption generation module adopts an encoder-decoder framework, where the CNN functions as
the encoder, extracting image features, while an LSTM-based Recurrent Neural Network (RNN)
operates as the decoder, generating descriptive captions.
• The encoded image features are fed into the LSTM model, which sequentially generates words to
form meaningful captions.
• Attention mechanisms are integrated into the model to allow dynamic focus on specific image regions
while generating descriptions, ensuring contextual relevance.
• A beam search strategy is used to refine the generated captions by evaluating multiple candidate
sequences and selecting the most probable one.

4. Text-to-Speech (TTS) Conversion

• The generated captions are processed by a text-to-speech synthesis module to convert them into
human-like speech.
• The text is first tokenized into phonemes using a linguistic model.
• Prosody modeling techniques are applied to adjust pitch, rhythm, and intonation, making the speech
output more natural and expressive.
• A deep learning-based neural vocoder such as WaveNet or Tacotron 2 is employed to generate high-
quality speech waveforms from the processed phonemes.
• The synthesized speech is refined using post-processing techniques to enhance clarity and
intelligibility.

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 963

www.ijcspub.org © 2025 IJCSPUB | Volume 15, Issue 1 March 2025 | ISSN: 2250-1770
5. Model Training and Optimization
• The system is trained using a combination of supervised learning and reinforcement learning
techniques.
The image captioning model is optimized with loss functions like cross-entropy loss and evaluated
using the BLEU
score to enhance accuracy.
• The TTS system is fine-tuned using mean opinion score (MOS) ratings and perceptual quality metrics
to enhance speech naturalness.
• Transfer learning is utilized by fine-tuning pre-trained models on domain-specific datasets, enabling
them to adapt to various applications and improve performance in specific use cases.
• Model quantization and pruning techniques are used to reduce computational overhead and enable
real-time deployment on resource-constrained devices.

6. System Integration and Deployment

• The trained models are integrated into a unified framework that seamlessly connects image analysis,
caption generation, and speech synthesis.
• APIs and middleware components are developed to enable easy integration with multimedia
applications, assistive technologies, and real-time systems.
• The final system is deployed in cloud-based and edge computing environments, allowing for efficient
processing with minimal latency.
• Continuous monitoring and user feedback mechanisms are implemented to refine the model and
improve user experience over time.

VI. RESULTS

Fig.1. Home Page

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 964

Fig.2.Captioning

Fig.3.Audio Output

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 965

The study has demonstrated significant progress in the field of automated image captioning and voice
generation, with notable improvements in both accuracy and naturalness.

Key results include:

1. Image Captioning Performance

• Accuracy Improvements: The integration of CNNs for feature extraction and RNNs or Transformer
models for language generation has led to substantial improvements in caption quality. The captions
produced by these systems are more contextually relevant and accurate compared to older, template-
based methods.
• Contextual Relevance: Transformer-based models, such as UNITER and others incorporating
attention mechanisms, have shown significant strides in improving the alignment between visual
content and the generated captions. These systems have outperformed traditional RNN based models,
particularly in generating detailed descriptions for complex images.
• Multimodal Systems:The combination of image captioning and voice generation has proven
effective, enabling seamless conversion of captions into speech. Deep learning-based text-to-speech
(TTS) models, such as WaveNet and Tacotron 2, have greatly improved the naturalness and
expressiveness of generated speech, making it more fluid and intelligible.

2. Text-to-Speech (TTS) Performance

• Naturalness and Fluidity: TTS models, especially WaveNet and Tacotron 2, have demonstrated a
marked improvement in speech synthesis. These systems produce speech that is far more natural and
expressive compared to earlier models that used concatenative synthesis based methods.
• Real-Time Challenges: Despite the high quality of speech, latency issues still exist in real-time
applications, especially when using models that require significant computational resources. For
instance, WaveNet is known for its high computational cost, making it challenging for real time use
in resource-constrained environments.

3. Applications in Accessibility

• Visually Impaired Accessibility: The combination of automated image captioning and TTS has
shown significant potential in enhancing the accessibility of digital content for visually impaired
individuals. Systems that generate descriptive captions and convert them into speech are making a
meaningful impact in areas such as web browsing, social media, and educational content.

Engagement in Content Creation: The automatic generation of captions and voice outputs for
images and videos has been effectively implemented in content creation tools for platforms like social
media and marketing. These systems increase engagement by automating the captioning process,
allowing creators to focus on other aspects of content production .

VII. CONCLUSION

Automated image captioning and voice generation have advanced AI’s role in accessibility and human
computer interaction. By combining CNNs for image processing and LSTMs for text generation, these
systems produce more accurate and context-aware captions. TTS models like WaveNet and Tacotron 2 further
enhance natural and expressive voice synthesis, improving accessibility for visually impaired users and
enriching multimedia experiences.

Despite progress, challenges remain in dataset availability, model optimization, and computational
efficiency. Future improvements should focus on refining speech synthesis for more expressive output,
reducing computational overhead, and incorporating adaptive learning for personalized experiences.
Advancing these technologies will make digital interactions more inclusive, efficient, and engaging.

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 966

Future research should prioritize real-time optimization to minimize processing delays, ensuring a smoother
and more efficient image-to-speech conversion. Enhanced personalization is another key area, where speech
generation could be tailored to individual preferences through adaptive learning mechanisms. Zero-shot
learning techniques should also be explored, enabling models to generate captions for unseen images without
extensive retraining. The deployment of lightweight models for mobile and IoT applications is another
important direction, ensuring that captioning and TTS capabilities can be effectively utilized in low power
environments. Additionally, improving TTS models to generate more context-aware and emotionally
expressive speech will further enhance user engagement and accessibility.

IX. REFERENCES

1. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and Tell: A Neural Image Caption
Generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

2. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., & Bengio, Y. (2015). Show, Attend
and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International
Conference on Machine Learning (ICML), 2048-2057. https://arxiv.org/abs/1502.03044

3. Tacotron 2: Generative Sequence-to-Sequence Model for Text-to-Speech. (2017). arXiv preprint

arXiv:1703.10135. https://arxiv.org/abs/1703.10135

4. Shen, J., Ping, W., & Xu, Y. (2018). Tacotron 2: Generating Human-like Speech from Text. Proceedings
of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 4779-
4783. https://doi.org/10.1109/ICASSP.2018.8461310

5. van den Oord, A., Vinyals, O., & Schuster, M. (2016). WaveNet: A Generative Model for Raw Audio.
arXiv preprint arXiv:1609.03499. https://arxiv.org/abs/1609.03499

6. Li, J., Zhang, H., & Liu, Z. (2019). UNITER: Learning Universal Image-Text Representations.
Proceedings of the European Conference on Computer Vision (ECCV), 1045 1062.
https://arxiv.org/abs/1909.11740

7. Huang, Z., & Chan, W. (2017). Attention-based Models for Speech Recognition. Proceedings of the
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 5370-5374.
https://doi.org/10.1109/ICASSP.2017.7953020

8. Zhang, X., & Cheng, Y. (2020). Deep Learning for Image Captioning and Text-to-Speech Synthesis: A
Survey. Journal of Computer Science and Technology, 35(2), 389-413. https://doi.org/10.1007/s11390-
020-0132-5

9. Agarwal, A., & Schwing, A. G. (2017). Learning to Describe Scenes with Generative Models. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 39(10), 2042–2051.
https://doi.org/10.1109/TPAMI.2016.2585070

10. Desai, A., & Jain, A. (2022). Enhancing Multimodal Models for Real-World Applications. Journal of
Artificial Intelligence Research, 71(4), 560-576. https://doi.org/10.1613/jair.1.12145

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 967

Caption Text & Voice
No ratings yet
Caption Text & Voice
8 pages
Abstract:: Doi: 10.5281/zenodo.7923088
No ratings yet
Abstract:: Doi: 10.5281/zenodo.7923088
12 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
AI-Driven Image Captioning Insights
No ratings yet
AI-Driven Image Captioning Insights
6 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Project Review
No ratings yet
Project Review
12 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
New PDF
No ratings yet
New PDF
48 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
No ratings yet
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
6 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
Ref 12
No ratings yet
Ref 12
7 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
Review 3
No ratings yet
Review 3
18 pages
IJCRT2310418
No ratings yet
IJCRT2310418
8 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Report 1
No ratings yet
Report 1
34 pages
Automated Image Captioning via Deep Learning
No ratings yet
Automated Image Captioning via Deep Learning
6 pages
Deep Learning Image Captions
No ratings yet
Deep Learning Image Captions
9 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
2 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
Camera 2 Caption
No ratings yet
Camera 2 Caption
6 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
Image Caption Generator by Using CNN and LSTM: International Journal For Multidisciplinary Research
No ratings yet
Image Caption Generator by Using CNN and LSTM: International Journal For Multidisciplinary Research
6 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
BCA Image Captioning Project
No ratings yet
BCA Image Captioning Project
15 pages
Image Caption Generation Using Deep Neural Networks
No ratings yet
Image Caption Generation Using Deep Neural Networks
3 pages
RP Springer
No ratings yet
RP Springer
10 pages
Review 3
No ratings yet
Review 3
18 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Two Tier LSTM Model
No ratings yet
Two Tier LSTM Model
13 pages
Abstract Final Major Project
No ratings yet
Abstract Final Major Project
1 page
Deep Learning-Based Image Captioning For Visually
No ratings yet
Deep Learning-Based Image Captioning For Visually
7 pages
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
Image Captioning Using Deep Learning Mait
No ratings yet
Image Captioning Using Deep Learning Mait
8 pages
Image Caption Generation Methodologies
No ratings yet
Image Caption Generation Methodologies
7 pages
118 Presentation
No ratings yet
118 Presentation
26 pages
16th ICCCNT 2025 Paper 3019
No ratings yet
16th ICCCNT 2025 Paper 3019
6 pages
Deep Learning Image Captioning
No ratings yet
Deep Learning Image Captioning
6 pages
Paper 91-Comparative Evaluation of CNN Architectures
No ratings yet
Paper 91-Comparative Evaluation of CNN Architectures
9 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
TSP CMC 53245
No ratings yet
TSP CMC 53245
18 pages
Image Caption
No ratings yet
Image Caption
16 pages
Aneja Convolutional Image Captioning CVPR 2018 Paper
No ratings yet
Aneja Convolutional Image Captioning CVPR 2018 Paper
10 pages
Neural Image Captioning Report
No ratings yet
Neural Image Captioning Report
10 pages
Seminar Report Final
No ratings yet
Seminar Report Final
20 pages
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
No ratings yet
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
9 pages
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
No ratings yet
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
3 pages
Project Report Image Captioning Models Prakhar Dhyani
No ratings yet
Project Report Image Captioning Models Prakhar Dhyani
8 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Medication Error - Michael Sandin
No ratings yet
Medication Error - Michael Sandin
5 pages
Sample Question Paper For Business Research Methods
86% (14)
Sample Question Paper For Business Research Methods
3 pages
A Study On Self-Acceptance and Academic Achievement Among High School Students in Vellore District
No ratings yet
A Study On Self-Acceptance and Academic Achievement Among High School Students in Vellore District
10 pages
Advanced Programme Maths P2 2021
No ratings yet
Advanced Programme Maths P2 2021
14 pages
Year 1 Report Writing: Australian Animals
No ratings yet
Year 1 Report Writing: Australian Animals
3 pages
Sources, Acquisition and Classification of Data
No ratings yet
Sources, Acquisition and Classification of Data
6 pages
Tutorial Completo UNLP Soil Water Characterisitics
No ratings yet
Tutorial Completo UNLP Soil Water Characterisitics
40 pages
Executive Intelligence Essentials
No ratings yet
Executive Intelligence Essentials
2 pages
EPRI Programs 636465 79 88104194
No ratings yet
EPRI Programs 636465 79 88104194
8 pages
Lecture 5-Data Collection Methods
No ratings yet
Lecture 5-Data Collection Methods
18 pages
Core-Math Fellowships in Mathematics
No ratings yet
Core-Math Fellowships in Mathematics
2 pages
Action Research 070515
No ratings yet
Action Research 070515
2 pages
Clinical Assessment Test
No ratings yet
Clinical Assessment Test
49 pages
Cluster Sampling
No ratings yet
Cluster Sampling
13 pages
Family Health in Nursing
No ratings yet
Family Health in Nursing
10 pages
Wallaga University: March, 2023 Nekemte, Ethiopia
100% (1)
Wallaga University: March, 2023 Nekemte, Ethiopia
27 pages
Task 2 Annotated Bibliography
No ratings yet
Task 2 Annotated Bibliography
6 pages
Heliyon D 23 40759
No ratings yet
Heliyon D 23 40759
18 pages
Eurnex Whoswho Final
No ratings yet
Eurnex Whoswho Final
75 pages
Costumer Based Brand Equity DAVID Aaker
No ratings yet
Costumer Based Brand Equity DAVID Aaker
9 pages
Instructor's Manual For
No ratings yet
Instructor's Manual For
6 pages
Review Comments of "Economic Policy Uncertainty, Ownership Structure, and Smes Performance"
No ratings yet
Review Comments of "Economic Policy Uncertainty, Ownership Structure, and Smes Performance"
2 pages
Creating A Talent Management Philosophy: To Care?
No ratings yet
Creating A Talent Management Philosophy: To Care?
5 pages
Rwanda Dissertation
100% (2)
Rwanda Dissertation
5 pages
COR 015 Philo Reviewer 4 Files Merged
No ratings yet
COR 015 Philo Reviewer 4 Files Merged
17 pages
Maku Updated
No ratings yet
Maku Updated
11 pages
Mobile Learning for Gymnastics Education
No ratings yet
Mobile Learning for Gymnastics Education
11 pages
Math 15 Module
No ratings yet
Math 15 Module
4 pages
Critique of Mexico's Sugar Tax
No ratings yet
Critique of Mexico's Sugar Tax
4 pages
1-Introduction To Analytical Chemistry
100% (2)
1-Introduction To Analytical Chemistry
57 pages

Papers

Uploaded by

Papers

Uploaded by

www.ijcspub.

Automated Image Captioning And Voice

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 959

Mathematical Model for Automated Image Captioning and Voice Generation

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 961

1. Image Feature Extraction using Convolutional Neural Networks (CNNs)

2. Caption Generation using LSTM with Attention

3. Text-to-Speech (TTS) Conversion using Neural TTS (WaveNet/Tacotron 2)

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 962

1. Data Collection and Preprocessing

2. Image Feature Extraction

3. Caption Generation Using LSTM-Based Model

4. Text-to-Speech (TTS) Conversion

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 963

6. System Integration and Deployment

Fig.1. Home Page

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 964

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 965

Key results include:

1. Image Captioning Performance

2. Text-to-Speech (TTS) Performance

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 966

3. Tacotron 2: Generative Sequence-to-Sequence Model for Text-to-Speech. (2017). arXiv preprint

IJCSP25A1106 International Journal of Current Science (IJCSPUB) www.ijcspub.org 967

You might also like