Irjet V11i617
Irjet V11i617
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 87
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072
The study introduces Consistent Generative Query Networks command of natural language. It generates suggestions while
(CGQN), a novel model that might successfully construct evaluating the explanation's coherence with the use of
upcoming frames in a video sequence without requiring automatic metrics and human annotations. A platform for
consecutive input and output frames. In order to effectively crowdsourcing expert annotations is built into the product. A
and simultaneously sample frames that are temporally helpful tool for assessing how well NLP models understand
consistent at all time, the model first generates a latent physics and generate logical explanations is PhysNLU. It
representation from any set of frames. The CGQN consumes enhances NLU evaluation in the physics domain and
input frames and samples output frames entirely in parallel, increases the robustness of the NLP systems in this
enforcing consistency of the sampled frames. This is particular subject. [6]
accomplished by training on several correlated targets,
sampling a global latent, and using a deterministic rendering This paper presents a novel method for producing videos
network. In contrast to earlier video prediction models, this. that makes use of diffusion probabilistic models. Based on
Additionally, the study offers strong experimental evidence the concepts of diffusion processes, the authors propose a
in the form of stochastic 3D reconstruction and jumpy video method that iteratively alters a noise distribution to
forecasts to back up the methodology. [2] resemble the target distribution. They emphasise how
ineffective the techniques now employed for making movies
Using the technique presented in this study, motion blur is are at capturing complex temporal dynamics while providing
produced from a pair of unexposed pictures using a neural a full study of those techniques. The suggested diffusion
network architecture with a differentiable "line prediction" probabilistic model effectively models video data by utilising
layer. The scientists developed a synthetic dataset of motion- the temporal links between frames. The authors
blurred images using frame interpolation techniques, and demonstrate the effectiveness of their method on numerous
then they evaluated their model on a genuine test dataset. video datasets and compare it to state-of-the-art methods.
The approach is more suitable for teaching data synthesis The results show that the diffusion model generates videos
online using deep learning and faster than frame with better temporal coherence and realism that are of the
interpolation. [3] highest quality. The paper presents a potent and successful
strategy based on diffusion probabilistic modeling,
TiVGAN, or Text-to-Image-to-Video Generative Adversarial advancing the field of video creation. [7]
Network, is a network that produces films from text
synopses. A single image is first produced using the The paper introduces a novel video production technique
framework's incremental evolutionary generator, which based on video diffusion models. The authors propose
then gradually turns that image into a video clip with the extending diffusion probabilistic models to handle
appropriate length. The generator stabilizes the training sequential data, such as movies. They discuss the technical
process while conditioning on the input text using a number details of their strategy, which entails iteratively applying a
of techniques, including a two-stage training procedure, a diffusion process to create each frame of a video. The
progressive growth strategy, and a feature matching loss. authors demonstrate the effectiveness of their approach by
The network is trained using a combination of adversarial training video diffusion models on various video datasets
loss, feature matching loss, and perceptual loss. The precise and displaying the generated films. They also compare their
organizational structure of the network is provided in the strategy to other cutting-edge methods for producing videos,
additional material. [4] indicating that it performs better in terms of capturing
complex temporal dynamics and visual quality. The study
This study reviews in-depth the most recent video advances the field of video generation by providing a fresh
Generative Adversarial Networks (GANs) models. The paper and effective way utilising video diffusion models. [8]
begins by recapping earlier reviews of GANs, identifying
gaps in the research, and providing an overview of the main CogVideo, a comprehensive pretraining technique for
advancements made by GANs models and their variations. transformer-based text-to-video generation, is introduced in
After dividing video GAN models into unconditional and the paper. The authors propose a novel architecture that
conditional models, the research reviews each category. The comprises a two-stage pretraining procedure to capture both
conclusion of the work is the discussion of probable future textual and visual data. In two stages, a language model is
directions for video GANs research. Overall, this work is a refined using a huge video dataset after being pretrained on
valuable resource for anyone interested in the most recent a sizable text corpus in the first stage. Additionally, they
developments in video GAN models and their potential uses create a brand-new pretraining objective known as Cross-
in a range of industries. [5] modal Generative Matching (CGM) in order to align the text
and video representations. The performance of CogVideo on
Creating a domain-specific ontology and preparing a corpus several text-to-video production jobs is evaluated by the
of physics questions and answers are all steps in the process authors in comparison to other methods. The field of text-to-
of putting PhysNLU into practice. Multiple-choice physics video generation develops with the advent of a strong and
questions are available from PhysNLU to test one's effective pretraining technique using transformers. [9]
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 88
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072
The invention of a way to generate text from video without demonstrate Text2Video-Zero's effectiveness and its ability
text from video data. The authors suggest a novel framework to create convincing and artistically beautiful films using a
called Make-a-Video that combines a text-to-image model variety of video datasets. With the use of trained text-to-
with a video prediction model. The text-to-image model image models and only textual descriptions, the method
initially builds a static image representation based on the offers a zero-shot approach that makes it possible to create
input text, and then the video prediction model gradually videos. [13]
creates future frames by using motion information. By
training these models in a self-supervised manner, the The paper presents a novel approach for text-to-image
authors demonstrate that it is possible to create films from synthesis using generative adversarial networks (GANs). The
text descriptions without the need for coupled text-video authors propose a model that incorporates a text encoder, an
data. The proposed approach is tested on several datasets, image generator, and a discriminator network. The text
confirming its effectiveness in creating engaging and diverse encoder transfers textual descriptions to an embedding
videos. The study provides a contribution to the field of text- space that the picture generator uses to synthesise images.
to-video creation by outlining a promising technique that The discriminator network distinguishes between real and
does away with the need for text-video training pairings. fake images and offers feedback to aid in the creation of
[10] images. By comparing their model to benchmark datasets,
the authors show how effectively it can generate visually
This study introduces Imagen Video, a technique for coherent and semantically meaningful images from text
producing high-definition videos that makes use of diffusion inputs. By doing this, the gap between written explanations
models. The authors propose an extension of the diffusion and the development of realistic images is effectively closed.
probabilistic model designed specifically for the production [14]
of high-quality films. They create a hierarchical diffusion
approach to capture both geographic and temporal This work introduces Promptify, a method for text-to-image
dependencies inside the video frames. After being trained generation that combines interactive prompt exploration
with a variety of noise levels, the model creates each frame with massive language models. The authors propose a novel
by iteratively converting the noise. The authors demonstrate framework that combines human input with important
Imagen Video's effectiveness using various video datasets, language models to generate high-quality images from
emphasising its ability to create high-resolution films with textual prompts. With Promptify, users may iteratively
improved visual clarity and coherence. High-definition tweak the prompt and get rapid visual feedback from the
videos are created through an efficient procedure. [11] model. This interactive research allows users to precisely
control the production of the desired image by altering the
The article provides a brief introduction to DiffusionDB, a prompt phrasing. The success of Promptify is demonstrated
massive prompt gallery dataset designed for creating and by the authors via user studies and comparisons with
testing text-to-image generative models. The writers address alternative strategies, emphasising improved image quality
the issue of limited diversity and quality in existing prompt and user satisfaction. [15]
databases by gathering a significant collection of varied and
aesthetically beautiful questions. Each of the several text 3. METHODOLOGY OF PROPOSED SYSTEM
prompts in DiffusionDB has a high-quality image drawn from
an open-access library. A vast variety of visual concepts, The proposed text-to-video synthesis system employs a
objects, situations, and styles were explicitly included in the systematic methodology that seamlessly integrates textual
dataset when it was designed. In order to assess the descriptions with visual content to produce coherent and
diversity and coverage of the prompt galleries, the authors realistic videos.
also offer evaluation indicators. They develop and evaluate
cutting-edge text-to-image generative models to The videos produced are available in 2 classes:
demonstrate DiffusionDB's utility. The paper's large-scale Generic Videos
prompt gallery dataset is a valuable tool that enables more Enhanced Videos
thorough training and evaluation of text-to-image generative
models. [12] The following flow diagram represents the steps involved in
the proposed system:
In this study, Text2Video-Zero, a novel technique for
creating zero-shot films using text-to-image diffusion
models, is introduced. The authors present a method that
employs pre-trained text-to-image models to create video
sequences directly from written descriptions without the
need for video-specific training. By customising the diffusion
process on textual inputs, the model may generate a variety
of intriguing and well-coordinated video frames. They
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 89
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 90
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072
investigation of cutting-edge methods for text-to-video communicate them is demonstrated by the underlying text-
synthesis. to-video model's effective transformation of prompts into
cohesive video sequences. By including style transfer
3.3 OpenCV: through VideoLORA, the created videos are further improved
To manage video data, the system relies on OpenCV, the and are infused with distinctive artistic styles that match the
Open Source Computer Vision Library. Its extensive selected prompts. The outcomes highlight how well the
collection of features enables effective image processing, system combined complex visual components with language
frame extraction, and video file I/O. With tools to improve clues to produce videos that are resonant in both substance
video quality and coherence, OpenCV's capabilities also and style.
include feature detection, object tracking, and video
stabilisation. Additionally, the fact that it supports a variety The selection of VideoLORA styles has a big impact on the
of picture formats guarantees compatibility with multiple personality and atmosphere of the generated videos in the
multimedia sources, which is essential when working with context of style transfer. The successful integration of artistic
heterogeneous data in the text-to-video pipeline. The system styles highlights the opportunity for individualised and
uses OpenCV to provide reliable video manipulation, which flexible video content production. The system's reliance on
is essential for turning written descriptions into aesthetically effective libraries like PyTorch and PyTorch Lightning
appealing and cohesive video outputs. further simplifies the generation process and enables quick
testing and improvement. The system's demonstrated
3.4 AV: capabilities, supported by reliable and repeatable findings,
A multimedia library called AV gives the system the ability to set the groundwork for subsequent developments in
manage sophisticated audiovisual data. This library makes it multimedia synthesis, stimulating further research and
possible to seamlessly combine audio and video elements, development at the nexus of computer vision and natural
assuring synchronisation and improving the overall usability language processing.
of the output films. AV is excellent at managing codecs,
handling multimedia metadata, and parsing and decoding The following diagrams illustrate the various image frames
video files. The system's adaptability is further increased by of videos generated using the various model classes:
its support for several video formats and streaming, which
accepts various data inputs and output formats. The system Text Prompt: An astronaut riding horse in outer space
incorporates AV to ensure that the final movies
appropriately reflect the intended storylines and styles
developed from the text prompts while also maintaining the
integrity of multimedia content.
3.5 MoviePy:
The system uses MoviePy, a video editing library, as its
creative toolbox for enhancing video outputs. The system Fig- 2. An astronaut riding horse in outer space-
can smoothly incorporate transitions, apply visual effects, Genearlised Video
and assemble films thanks to its simple-to-use API. The use
of text and picture overlays with MoviePy's capabilities
enables the inclusion of extra information or branding into
the videos. It makes it easier to create polished, professional-
level videos because to its ability to concatenate videos, cut
segments, and alter video characteristics. A wide range of
multimedia players and platforms are supported by
MoviePy's integration with several video file formats,
ensuring compatibility with the output videos with a variety Fig- 3. An astronaut riding horse in outer space –
of viewers. The system uses MoviePy to add an artistic layer COCOStyle
to the generated videos, improving their visual appeal and
narrative quality.
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 91
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 92
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 06 | Jun 2024 www.irjet.net p-ISSN: 2395-0072
[14] Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., &
Lee, H. (2016). Text-to-image synthesis using generative
adversarial networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 4794-
4803. doi:10.1109/cvpr.2016.296
[15] Brade, S., Wang, B., Sousa, M., Oore, S., & Grossman, T.
(2023). Promptify: Interactive prompt exploration for text-
to-image generation. arXiv preprint arXiv:2304.09337.
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 93