0% found this document useful (0 votes)
7 views7 pages

Irjet V11i617

The document discusses the development of Text2Video, an AI-driven system that synthesizes videos from text prompts by integrating natural language processing and computer vision. It highlights the use of pre-trained models and style transfer techniques to create visually appealing videos that convey textual content effectively. The research emphasizes the innovative potential of AI in multimedia content creation and showcases various methodologies and frameworks employed in text-to-video synthesis.

Uploaded by

miicrosoftcmt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

Irjet V11i617

The document discusses the development of Text2Video, an AI-driven system that synthesizes videos from text prompts by integrating natural language processing and computer vision. It highlights the use of pre-trained models and style transfer techniques to create visually appealing videos that convey textual content effectively. The research emphasizes the innovative potential of AI in multimedia content creation and showcases various methodologies and frameworks employed in text-to-video synthesis.

Uploaded by

miicrosoftcmt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072

Text2Video: AI-driven Video Synthesis from Text Prompts


Shankar Tejasvi1, Merin Meleet1

1Department of Information Science and Engineering,


RV College of Engineering
Bengaluru, Karnataka, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - The emerging discipline of text-to-video synthesis PyTorch, PyTorch Lightning, and OpenCV. Contextual
combines computer vision and natural language information from the input text is extracted in this
understanding to create coherent, realistic videos that are procedure, and the information is then converted into visual
based on written descriptions. The research is an endeavour to components.
provide a bridge between the fields of computer vision and
natural language processing by using a robust text-to-video The main goal of the work is to investigate how linguistic
production system. The system's main goal is to convert text and visual clues might be combined to produce movies that
prompts into visually appealing videos using pre-trained accurately convey textual material while also displaying
models and style transfer techniques, providing a fresh stylistic details. A key component of this system, style
approach to content development. The method demonstrates transfer enables the adoption of current visual styles onto
flexibility and effectiveness by including well-known libraries the produced videos, producing visually stunning results
like PyTorch, PyTorch Lightning, and OpenCV. The work that exemplify creative aesthetics. The system aims to
emphasises the potential of style transfer in boosting the demonstrate the effectiveness of its methodology in video
creative quality of visual outputs by emphasising its capability production with a variety of styles, so showcasing the
to make videos with distinct styles through rigorous possibilities for innovation and customization.
experimentation. The outcomes illustrate how language clues
and artistic aesthetics can be successfully combined, as well as This work contributes to the changing environment of
the system's ramifications for media production, content creation as artificial intelligence and multimedia
entertainment, and communication. This study adds to the continue to converge by providing insights into the
rapidly changing field of text-to-video synthesis and opportunities made possible by the interaction between
exemplifies the fascinating opportunities that result from the language and visual. The research highlights the game-
fusion of artificial intelligence and the production of changing possibilities of AI-driven multimedia synthesis by
multimedia content. showcasing the capabilities of text-to-video production
combined with style transfer.
Key Words: Text to Video, Pre- Trained Models, Style
Transfer, Multimedia Content Creation, Natural Language 2. LITERATURE REVIEW
Processing
The method for zero-shot picture categorization that is
1.INTRODUCTION suggested in this study makes use of human gaze as auxiliary
data. A paradigm for data collecting that involves a
Natural language processing (NLP) and computer vision discriminating task is suggested in order to increase the
have recently come together to revolutionise the way that information content of the gaze data. The paper also
multimedia material is produced. A fascinating area of this proposes three gaze embedding algorithms that exploit
confluence is text-to-video creation, which includes creating spatial layout, location, duration, sequential ordering, and
visual stories out of written prompts. Due to its potential user's concentration characteristics to extract discriminative
applications in a variety of industries, including descriptors from gaze data. The technique is implemented
entertainment, education, advertising, and communication, on the CUB-VW dataset, and several experiments are
this developing topic has attracted significant attention. conducted to evaluate its effectiveness. The results show that
Text-to-video generation offers a cutting-edge method of human gaze discriminates between classes better than
information sharing by enabling the transformation of mouse-click data and expert-annotated characteristics. The
written descriptions into compelling visual content. authors acknowledge that although their approach is
generalizable to other areas, finer-grained datasets would
The complexity of text-to-video production is explored in benefit from utilising different data collection
this work, with a focus on using pre-trained models and style methodologies. Overall, the suggested strategy provides a
transfer methods. The work's goal is to make it easier to more precise and organic way to identify class membership
convert textual cues into dynamic video sequences by in zero-shot learning contexts. [1]
utilising the strength of well-known frameworks like

© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 87
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072

The study introduces Consistent Generative Query Networks command of natural language. It generates suggestions while
(CGQN), a novel model that might successfully construct evaluating the explanation's coherence with the use of
upcoming frames in a video sequence without requiring automatic metrics and human annotations. A platform for
consecutive input and output frames. In order to effectively crowdsourcing expert annotations is built into the product. A
and simultaneously sample frames that are temporally helpful tool for assessing how well NLP models understand
consistent at all time, the model first generates a latent physics and generate logical explanations is PhysNLU. It
representation from any set of frames. The CGQN consumes enhances NLU evaluation in the physics domain and
input frames and samples output frames entirely in parallel, increases the robustness of the NLP systems in this
enforcing consistency of the sampled frames. This is particular subject. [6]
accomplished by training on several correlated targets,
sampling a global latent, and using a deterministic rendering This paper presents a novel method for producing videos
network. In contrast to earlier video prediction models, this. that makes use of diffusion probabilistic models. Based on
Additionally, the study offers strong experimental evidence the concepts of diffusion processes, the authors propose a
in the form of stochastic 3D reconstruction and jumpy video method that iteratively alters a noise distribution to
forecasts to back up the methodology. [2] resemble the target distribution. They emphasise how
ineffective the techniques now employed for making movies
Using the technique presented in this study, motion blur is are at capturing complex temporal dynamics while providing
produced from a pair of unexposed pictures using a neural a full study of those techniques. The suggested diffusion
network architecture with a differentiable "line prediction" probabilistic model effectively models video data by utilising
layer. The scientists developed a synthetic dataset of motion- the temporal links between frames. The authors
blurred images using frame interpolation techniques, and demonstrate the effectiveness of their method on numerous
then they evaluated their model on a genuine test dataset. video datasets and compare it to state-of-the-art methods.
The approach is more suitable for teaching data synthesis The results show that the diffusion model generates videos
online using deep learning and faster than frame with better temporal coherence and realism that are of the
interpolation. [3] highest quality. The paper presents a potent and successful
strategy based on diffusion probabilistic modeling,
TiVGAN, or Text-to-Image-to-Video Generative Adversarial advancing the field of video creation. [7]
Network, is a network that produces films from text
synopses. A single image is first produced using the The paper introduces a novel video production technique
framework's incremental evolutionary generator, which based on video diffusion models. The authors propose
then gradually turns that image into a video clip with the extending diffusion probabilistic models to handle
appropriate length. The generator stabilizes the training sequential data, such as movies. They discuss the technical
process while conditioning on the input text using a number details of their strategy, which entails iteratively applying a
of techniques, including a two-stage training procedure, a diffusion process to create each frame of a video. The
progressive growth strategy, and a feature matching loss. authors demonstrate the effectiveness of their approach by
The network is trained using a combination of adversarial training video diffusion models on various video datasets
loss, feature matching loss, and perceptual loss. The precise and displaying the generated films. They also compare their
organizational structure of the network is provided in the strategy to other cutting-edge methods for producing videos,
additional material. [4] indicating that it performs better in terms of capturing
complex temporal dynamics and visual quality. The study
This study reviews in-depth the most recent video advances the field of video generation by providing a fresh
Generative Adversarial Networks (GANs) models. The paper and effective way utilising video diffusion models. [8]
begins by recapping earlier reviews of GANs, identifying
gaps in the research, and providing an overview of the main CogVideo, a comprehensive pretraining technique for
advancements made by GANs models and their variations. transformer-based text-to-video generation, is introduced in
After dividing video GAN models into unconditional and the paper. The authors propose a novel architecture that
conditional models, the research reviews each category. The comprises a two-stage pretraining procedure to capture both
conclusion of the work is the discussion of probable future textual and visual data. In two stages, a language model is
directions for video GANs research. Overall, this work is a refined using a huge video dataset after being pretrained on
valuable resource for anyone interested in the most recent a sizable text corpus in the first stage. Additionally, they
developments in video GAN models and their potential uses create a brand-new pretraining objective known as Cross-
in a range of industries. [5] modal Generative Matching (CGM) in order to align the text
and video representations. The performance of CogVideo on
Creating a domain-specific ontology and preparing a corpus several text-to-video production jobs is evaluated by the
of physics questions and answers are all steps in the process authors in comparison to other methods. The field of text-to-
of putting PhysNLU into practice. Multiple-choice physics video generation develops with the advent of a strong and
questions are available from PhysNLU to test one's effective pretraining technique using transformers. [9]

© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 88
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072

The invention of a way to generate text from video without demonstrate Text2Video-Zero's effectiveness and its ability
text from video data. The authors suggest a novel framework to create convincing and artistically beautiful films using a
called Make-a-Video that combines a text-to-image model variety of video datasets. With the use of trained text-to-
with a video prediction model. The text-to-image model image models and only textual descriptions, the method
initially builds a static image representation based on the offers a zero-shot approach that makes it possible to create
input text, and then the video prediction model gradually videos. [13]
creates future frames by using motion information. By
training these models in a self-supervised manner, the The paper presents a novel approach for text-to-image
authors demonstrate that it is possible to create films from synthesis using generative adversarial networks (GANs). The
text descriptions without the need for coupled text-video authors propose a model that incorporates a text encoder, an
data. The proposed approach is tested on several datasets, image generator, and a discriminator network. The text
confirming its effectiveness in creating engaging and diverse encoder transfers textual descriptions to an embedding
videos. The study provides a contribution to the field of text- space that the picture generator uses to synthesise images.
to-video creation by outlining a promising technique that The discriminator network distinguishes between real and
does away with the need for text-video training pairings. fake images and offers feedback to aid in the creation of
[10] images. By comparing their model to benchmark datasets,
the authors show how effectively it can generate visually
This study introduces Imagen Video, a technique for coherent and semantically meaningful images from text
producing high-definition videos that makes use of diffusion inputs. By doing this, the gap between written explanations
models. The authors propose an extension of the diffusion and the development of realistic images is effectively closed.
probabilistic model designed specifically for the production [14]
of high-quality films. They create a hierarchical diffusion
approach to capture both geographic and temporal This work introduces Promptify, a method for text-to-image
dependencies inside the video frames. After being trained generation that combines interactive prompt exploration
with a variety of noise levels, the model creates each frame with massive language models. The authors propose a novel
by iteratively converting the noise. The authors demonstrate framework that combines human input with important
Imagen Video's effectiveness using various video datasets, language models to generate high-quality images from
emphasising its ability to create high-resolution films with textual prompts. With Promptify, users may iteratively
improved visual clarity and coherence. High-definition tweak the prompt and get rapid visual feedback from the
videos are created through an efficient procedure. [11] model. This interactive research allows users to precisely
control the production of the desired image by altering the
The article provides a brief introduction to DiffusionDB, a prompt phrasing. The success of Promptify is demonstrated
massive prompt gallery dataset designed for creating and by the authors via user studies and comparisons with
testing text-to-image generative models. The writers address alternative strategies, emphasising improved image quality
the issue of limited diversity and quality in existing prompt and user satisfaction. [15]
databases by gathering a significant collection of varied and
aesthetically beautiful questions. Each of the several text 3. METHODOLOGY OF PROPOSED SYSTEM
prompts in DiffusionDB has a high-quality image drawn from
an open-access library. A vast variety of visual concepts, The proposed text-to-video synthesis system employs a
objects, situations, and styles were explicitly included in the systematic methodology that seamlessly integrates textual
dataset when it was designed. In order to assess the descriptions with visual content to produce coherent and
diversity and coverage of the prompt galleries, the authors realistic videos.
also offer evaluation indicators. They develop and evaluate
cutting-edge text-to-image generative models to The videos produced are available in 2 classes:
demonstrate DiffusionDB's utility. The paper's large-scale  Generic Videos
prompt gallery dataset is a valuable tool that enables more  Enhanced Videos
thorough training and evaluation of text-to-image generative
models. [12] The following flow diagram represents the steps involved in
the proposed system:
In this study, Text2Video-Zero, a novel technique for
creating zero-shot films using text-to-image diffusion
models, is introduced. The authors present a method that
employs pre-trained text-to-image models to create video
sequences directly from written descriptions without the
need for video-specific training. By customising the diffusion
process on textual inputs, the model may generate a variety
of intriguing and well-coordinated video frames. They

© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 89
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072

 Specify the path to the base model checkpoint


and its configuration.
 Execute sample_text2video.py with provided
parameters for base text-to-video generation.
 Display the resulting video using HTML to
showcase the generated content.

5. Text-to-Video Generation with Style Transfer


(VideoLORA):
 Select a VideoLORA style from the available
options based on the chosen LORA_PATH.
 Set the prompt, output directory, and style
parameters.
 Execute sample_text2video.py with additional
parameters for VideoLORA integration.
 Display the style-transferred video using
HTML to visualize the synthesized content.
Fig- 1. Flow Diagram 6. Result Visualization:
 Extract the latest video file generated in the
The process of video generation from given text prompt output directory.
involves the following 6-stage process:
 Convert the video into a data URL for display
using base64 encoding.
1. Environment Setup:
 Display the video animation in the notebook
 Ensure CUDA availability in the notebook
using HTML and the data URL.
settings for GPU acceleration.
 Install Miniconda, granting execution
3.1 PyTorch:
permissions and setting the installation path to
In this system, PyTorch, a flexible and popular deep learning
/usr/local.
framework, is essential. The creation of sophisticated neural
 Replace the system's Python version with
networks for text-to-video synthesis is made possible by its
version 3.8 using update-alternatives.
dynamic computation graph and automatic differentiation.
 Install necessary packages using apt-get and PyTorch excels at managing a variety of model topologies
python3-pip. and provides a high level of flexibility that is essential for
adjusting to the complexities of producing videos from text
2. Project Setup and Dependencies: input. Additionally, the speed of computation is substantially
 Clone the VideoCrafter project repository and increased by its GPU acceleration capabilities, making it a
navigate to the relevant directory. crucial element for effectively processing huge amounts of
 Set the PYTHONPATH to include the project data during training and inference. The integration of
directory. cutting-edge deep learning algorithms in this system is
 Install required PyTorch and related packages further streamlined by PyTorch's large ecosystem of pre-
with specified versions for compatibility. built modules and community support.
 Install additional libraries such as PyTorch
Lightning, OmegaConf, and OpenCV using 3.2 PyTorch Lightning:
python3-pip. By removing low-level training loop details and streamlining
 Install packages like AV and MoviePy for distributed training, PyTorch Lightning boosts efficiency.
multimedia processing. This package automates data parallelism, checkpointing, and
GPU allocation, enabling the system to scale up smoothly to
3. Model Acquisition and Configuration: take advantage of the hardware resources. PyTorch
 Clone the VideoLORA model repository and Lightning improves code readability and maintainability by
move the models to the appropriate directory. the use of standardised procedures, promoting teamwork.
 Define the available VideoLORA styles and Its built-in support for cutting-edge features like gradient
their corresponding paths. accumulation and mixed-precision training optimises
training effectiveness while using less memory. The system
4. Text-to-Video Generation (Base Model): can concentrate on model architecture and experimental
 Set the desired text prompt and output design thanks to PyTorch Lightning, expediting the
directory.

© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 90
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072

investigation of cutting-edge methods for text-to-video communicate them is demonstrated by the underlying text-
synthesis. to-video model's effective transformation of prompts into
cohesive video sequences. By including style transfer
3.3 OpenCV: through VideoLORA, the created videos are further improved
To manage video data, the system relies on OpenCV, the and are infused with distinctive artistic styles that match the
Open Source Computer Vision Library. Its extensive selected prompts. The outcomes highlight how well the
collection of features enables effective image processing, system combined complex visual components with language
frame extraction, and video file I/O. With tools to improve clues to produce videos that are resonant in both substance
video quality and coherence, OpenCV's capabilities also and style.
include feature detection, object tracking, and video
stabilisation. Additionally, the fact that it supports a variety The selection of VideoLORA styles has a big impact on the
of picture formats guarantees compatibility with multiple personality and atmosphere of the generated videos in the
multimedia sources, which is essential when working with context of style transfer. The successful integration of artistic
heterogeneous data in the text-to-video pipeline. The system styles highlights the opportunity for individualised and
uses OpenCV to provide reliable video manipulation, which flexible video content production. The system's reliance on
is essential for turning written descriptions into aesthetically effective libraries like PyTorch and PyTorch Lightning
appealing and cohesive video outputs. further simplifies the generation process and enables quick
testing and improvement. The system's demonstrated
3.4 AV: capabilities, supported by reliable and repeatable findings,
A multimedia library called AV gives the system the ability to set the groundwork for subsequent developments in
manage sophisticated audiovisual data. This library makes it multimedia synthesis, stimulating further research and
possible to seamlessly combine audio and video elements, development at the nexus of computer vision and natural
assuring synchronisation and improving the overall usability language processing.
of the output films. AV is excellent at managing codecs,
handling multimedia metadata, and parsing and decoding The following diagrams illustrate the various image frames
video files. The system's adaptability is further increased by of videos generated using the various model classes:
its support for several video formats and streaming, which
accepts various data inputs and output formats. The system Text Prompt: An astronaut riding horse in outer space
incorporates AV to ensure that the final movies
appropriately reflect the intended storylines and styles
developed from the text prompts while also maintaining the
integrity of multimedia content.

3.5 MoviePy:
The system uses MoviePy, a video editing library, as its
creative toolbox for enhancing video outputs. The system Fig- 2. An astronaut riding horse in outer space-
can smoothly incorporate transitions, apply visual effects, Genearlised Video
and assemble films thanks to its simple-to-use API. The use
of text and picture overlays with MoviePy's capabilities
enables the inclusion of extra information or branding into
the videos. It makes it easier to create polished, professional-
level videos because to its ability to concatenate videos, cut
segments, and alter video characteristics. A wide range of
multimedia players and platforms are supported by
MoviePy's integration with several video file formats,
ensuring compatibility with the output videos with a variety Fig- 3. An astronaut riding horse in outer space –
of viewers. The system uses MoviePy to add an artistic layer COCOStyle
to the generated videos, improving their visual appeal and
narrative quality.

4. RESULTS AND DISCUSSION

The system's results shed important light on the efficacy and


potential of the suggested text-to-video generating
methodology. The produced videos show an impressive
conversion of written information into lively visual tales. The Fig- 4. An astronaut riding horse in outer space –
system's ability to grasp textual nuances and visually MakotoShinkai

© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 91
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072

5. CONCLUSION on Computer Vision and Pattern Recognition (CVPR), 4525-


4534. doi:10.1109/CVPR.2017.679
By utilising the synergy between natural language
processing and computer vision, this system acts as a [2] Kumar, A., Eslami, S. M. A., Rezende, D., Garnelo, M., Viola,
creative and dynamic investigation of text-to-video F., Lockhart, E., & Shanahan, M. (2019). Consistent generative
production. The system successfully connects textual query networks for future frame prediction in videos. arXiv
prompts and visually appealing video outputs by integrating preprint arXiv:1807.02033.
pre-trained models, style transfer, and a comprehensive set
of libraries like PyTorch, PyTorch Lightning, OpenCV, AV, [3] Brooks, T., & Barron, J. T. (2019). Generating motion blur
MoviePy, and OmegaConf. The combination of language clues from unblurred photos using neural networks. In
with visual aesthetics reveals the potential for the creation of Proceedings of the IEEE/CVF Conference on Computer
creative content across a range of fields, including Vision and Pattern Recognition (CVPR), 6840-6848.
communication, education, and entertainment. The system's doi:10.1109/cvpr.2019.06840
methodology's application and adaptability are highlighted
by the methodical way in which it was implemented, which [4] Kim, D., Joo, D., & Kim, J. (2020). TiVGAN: Text-to-image-
was founded on effective code execution and best practises. to-video generative adversarial network. IEEE Access, 8,
This system adds to the changing landscape of content 153113-153122. doi:10.1109/access.2020.2986494
creation by highlighting the capabilities of AI-driven
multimedia synthesis, paving the door for interesting [5] Aldausari, N., Sowmya, A., Marcus, N., & Mohammadi, G.
developments at the nexus of artificial intelligence and (2022). Review of video generative adversarial networks
multimedia technologies. (GANs) models. ACM Computing Surveys, 55(2), Article 30.
doi:10.1145/3487891
6. LIMITATIONS AND CONCLUSION
[6] Meadows, J., Zhou, Z., & Freitas, A. (2022). PhysNLU: A
Despite the established text-to-video synthesis system's tool for evaluating natural language understanding in
excellent achievements, there are several drawbacks to its physics. In Proceedings of the 13th Conference on Language
current design that must be acknowledged. The model's Resources and Evaluation (LREC 2022), 4904-4912.
dependence on its training data is a noteworthy restriction; doi:10.18653/lrec-2022-4904
deviations from the training corpus may lead to errors.
Additionally, real-time applications can be hampered by the [7] Yang, R., Srivastava, P., & Mandt, S. (2022). Video creation
high computing demands of training and producing movies. using diffusion probabilistic models. arXiv preprint
Future versions could concentrate on improving model arXiv:2203.09481.
generalisation and increasing computing effectiveness to
address these limitations. [8] Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., &
Fleet, D. J. (2022). Video generation using video diffusion
There are many opportunities for growth and improvement models. arXiv preprint arXiv:2204.03409.
in the future. The system's realism and congruence with
human perception might be strengthened by the [9] Hong, W., Ding, M., Zheng, W., Liu, X., & Tang, J. (2022).
incorporation of user feedback through human evaluations. CogVideo: Large-scale pretraining for transformer-based
Investigating methods to incorporate finer-grained control text-to-video generation. arXiv preprint arXiv:2205.15868.
over video qualities, like style and mood, could produce a
variety of results. The dataset's flexibility to different [10] Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S.,
settings might be improved by adding a variety of textual Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., &
stimuli. Additionally, improvements in transfer learning and Taigman, Y. (2022). Make-a-Video:Text-to-Video Generation
multimodal pre-training may open up new possibilities for without Text-Video Data. arXiv preprint arXiv:2209.14792
text-to-video synthesis. While admirable, this study system
only touches the surface of a broad field, leaving plenty of [11] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko,
potential for creativity and inquiry at the dynamic A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., &
confluence of text and visual information. Salimans, T. (2022). Imagen Video: High-Definition Video
Generation using Diffusion Models. arXiv preprint
arXiv:2210.02303.
REFERENCES
[12] Wang, Z. J., Montoyo, E., Munechika, D., Yang, H., Hoover,
[1] Karessli, N., Akata, Z., Schiele, B., & Bulling, A. (2017). B., & Chau, D. H. (2022). DiffusionDB: A sizable prompt
Zero-Shot Image Classification using Human Gaze as gallery dataset for text-to-image generative models. arXiv
Auxiliary Information. In Proceedings of the IEEE Conference preprint arXiv:2210.11890.

© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 92
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072

[13] Khachatryan, L., Movsisyan, A., Tadevosyan, V.,


Henschel, R., Wang, Z., Navasardyan, S., & Shi, H. (2023).
Text2Video-Zero: Zero-Shot Video Generation using Text-to-
Image Diffusion Models. arXiv preprint arXiv:2303.13439.

[14] Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., &
Lee, H. (2016). Text-to-image synthesis using generative
adversarial networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 4794-
4803. doi:10.1109/cvpr.2016.296

[15] Brade, S., Wang, B., Sousa, M., Oore, S., & Grossman, T.
(2023). Promptify: Interactive prompt exploration for text-
to-image generation. arXiv preprint arXiv:2304.09337.

© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 93

You might also like