0% found this document useful (0 votes)
33 views36 pages

Ijimai 9 1 16

This document surveys the advancements in AI-generated images and videos, focusing on generative techniques and their detection methods. It discusses the potential applications and risks associated with synthetic media, highlighting the need for effective detection strategies due to the increasing sophistication of generative models. The survey also identifies key trends and challenges, including computational requirements and ethical considerations, while providing insights into future research directions.

Uploaded by

chaima.aouiche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views36 pages

Ijimai 9 1 16

This document surveys the advancements in AI-generated images and videos, focusing on generative techniques and their detection methods. It discusses the potential applications and risks associated with synthetic media, highlighting the need for effective detection strategies due to the increasing sophistication of generative models. The survey also identifies key trends and challenges, including computational requirements and ethical considerations, while providing insights into future research directions.

Uploaded by

chaima.aouiche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Regular Issue

Advances in AI-Generated Images and Videos


Hessen Bougueffa1, Mamadou Keita1, Wassim Hamidouche2, Abdelmalik Taleb-Ahmed1, Helena Liz-López3,
Alejandro Martín3, David Camacho3, Abdenour Hadid4 *
1
Laboratory of IEMN, CNRS, Centrale Lille, UMR 8520, Univ. Polytechnique Hauts-de-France (France)
2
Univ. Rennes, INSA Rennes, CNRS, IETR - UMR, Rennes, 6164 (France)
3
Computer Systems Department, Universidad Politécnica de Madrid (Spain)
4
Sorbonne Center for Artificial Intelligence, Sorbonne University Abu Dhabi (United Arab Emirates)

* Corresponding author: bougueffaeutamenehessen@gmail.com (H. Bougueffa), Mamadou.Keita@uphf.fr (M. Keita),


whamidouche@gmail.com (W. Hamidouche), abdelmalik.taleb-ahmed@uphf.fr (A. Taleb-Ahmed), helena.liz@upm.es (H. Liz-
López), alejandro.martin@upm.es (A. Martín), david.camacho@upm.es (D. Camacho), abdenour.hadid@ieee.org (A. Hadid).

Received 13 November 2024 | Accepted 20 November 2024 | Early Access 28 November 2024

Abstract Keywords
In recent years generative AI models and tools have experienced a significant increase, especially techniques AI-Generated Content,
to generate synthetic multimedia content, such as images or videos. These methodologies present a wide Image Generation,
range of possibilities; however, they can also present several risks that should be taken into account. In this Multimodal, Video
survey we describe in detail different techniques for generating synthetic multimedia content, and we also Generation.
analyse the most recent techniques for their detection. In order to achieve these objectives, a key aspect is the
availability of datasets, so we have also described the main datasets available in the state of the art. Finally,
from our analysis we have extracted the main trends for the future, such as transparency and interpretability,
the generation of multimodal multimedia content, the robustness of models and the increased use of diffusion
models. We find a roadmap of deep challenges, including temporal consistency, computation requirements, DOI: 10.9781/ijimai.2024.11.003
generalizability, ethical aspects, and constant adaptation.

I. Introduction content, including simulations and visual aids to help illustrate and
clarify complex ideas, and adapting to different learning styles [13],

T he recent progress in Artificial intelligence (AI) has led to a


revolution in the creation of synthetic images and videos, mainly
due to the remarkable capabilities of advanced generative models,
[14]; security and forensics, helping to create robust models capable of
detecting false or generated information more easily, for example by
assisting in data augmentation [15], [16]. As we can see, the applications
diffusion models, or Generative adversarial networks (GANs), among of these techniques are limitless, and as their capabilities improve, they
others. There are now a large number of applications and tools can be more easily applied to different problems in society.
available to users, such as DALL-E [1], GLIDE [2], Midjourney [3], This collection of tools and methodologies not only presents
Imagen [4], VideoPoet [5], Sora [6], or Genie [7]. These tools are advantages, but also a number of weaknesses and potential risks that
designed to produce realistic and believable digital content easily. This need to be carefully analysed. The ability to produce highly realistic
development has had a profound impact, with various applications synthetic media easily causes concern about their possible inappropriate
across different areas. use. Deepfakes and other kinds of manipulated content can be used
These techniques are capable of generating multimedia content to spread misinformation, create disinformation, and manipulate
on any topic or object. Therefore, there are countless opportunities, public opinion, undermining trust in digital media [17], [18]. This dual
especially in some application domains, which can benefit greatly potential for both positive and negative impact highlights a crucial
from these techniques and tools: entertainment and media, allowing problem. While leveraging the benefits of generative models, there is
the generation of characters, scenarios or elements that would be very an urgent need to develop effective detection methods to distinguish
difficult to create by traditional means [8]–[10]; creative industries, between real and AI-generated content. As generative models become
allows artists to streamline their work and improve its quality, for more sophisticated, the task of detecting synthetic media becomes
example by creating sketches to work on further, or creating elements increasingly complex, necessitating the continuous evolution of
to add to their work [11], [12]; education, creating engaging educational detection techniques.

Please cite this article as:


H. Bougueffa, M. Keita, W. Hamidouche, A. Taleb-Ahmed, H. Liz-López, A. Martín, D. Camacho, A. Hadid. Advances in AI-Generated Images and Videos,
International Journal of Interactive Multimedia and Artificial Intelligence, vol. 9, no. 1, pp. 173-208, 2024, http://dx.doi.org/10.9781/ijimai.2024.11.003

- 173 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

Despite the significant advancements in generative models, several Section VI identifies the ongoing challenges in both generating and
gaps and challenges persist in both their deployment and the methods detecting synthetic media. Finally, Section VII concludes the survey
used to detect synthetic media. One major challenge lies in the by summarizing the key findings and suggesting future directions for
resource-intensive nature of training and deploying these models. High research and development in this field.
computational requirements limit accessibility, particularly for smaller
organizations and researchers lacking the necessary infrastructure
II. Related Work and Related Surveys
to fully utilise these technologies. This creates a barrier to wider
adoption and raises concerns about the scalability and sustainability The field of AI-generated images and videos has been extensively
of generative models as they continue to evolve. Furthermore, even studied, with several surveys reviewing the advancements and
advanced models such as GLIDE [2] and DALL·E 2 [1] encounter challenges in this area. This section provides an overview of key
challenges when processing complex prompts. These challenges can surveys and positions our work in relation to them, highlighting the
limit their ability to generate high-quality outputs under specific unique aspects of our approach, summarised in Table I.
conditions. Similarly, Imagen [19] enhances computational efficiency
• Liu et al. [24] conducted an extensive review on human image
but still grapples with resource demands and complex prompts. These
generation, categorizing existing techniques into three main
limitations underscore a need for improved flexibility and robustness
paradigms: data-driven, knowledge-guided, and hybrid. The survey
in current generative technologies.
covers the most representative models and approaches within each
On the video generation front, text-to-video models face significant paradigm, highlighting their specific advantages and limitations.
challenges in maintaining high fidelity and continuity of motion over Additionally, it explores a range of applications, datasets, and
extended sequences. Many existing methods simply extend text-to- evaluation metrics relevant to human image generation. The
image models, which do not fully address the unique complexities paper also addresses the challenges and potential future directions
inherent in video generation. This highlights the need for more in the field, offering valuable insights for researchers interested in
specialized approaches that can effectively handle the temporal this rapidly evolving domain.
dynamics and continuity required for high-quality video content.
• Chen et al. [28] concentrated on controllable text-to-image
Detecting synthetic media presents significant challenges. Current generation models. They investigated various methods that
detection models struggle to keep pace with the rapid advancements precisely control the produced content, such as personalized and
in generative technologies, making it difficult to reliably differentiate multi-condition generation techniques. The authors explore the
between real and AI-generated images and videos. These models tend practical applications of these models in content creation and
to specialize in the types of synthetic content they were trained on, design while also recognizing current constraints and suggesting
leading to poor performance when faced with new data from different future directions to enhance the adaptability and accuracy of these
or updated models. Additionally, detection algorithms must be generative models.
resilient against various transformations and adversarial attacks [20],
• Joshi et al. [29] provided an extensive analysis on the use of
[21], such as image compression and blurring, which can significantly
synthetic data in human analysis, focusing on the advantages
diminish their effectiveness. Techniques for identifying deepfakes
and challenges in biometric recognition, action recognition, and
[22] and other forms of image and video forgeries [23] also encounter
person re-identification. The survey delves into various techniques
obstacles due to the constantly evolving nature of these manipulations
for generating synthetic data, including deep generative models
and the need for high-quality datasets and standardized benchmarks.
and 3D rendering tools, emphasizing their potential to tackle
To address these challenges and advance the field, this survey: issues related to data scarcity, privacy concerns, and demographic
• Presents an updated picture of synthetic image generation and biases in training datasets. Additionally, the authors explore
detection techniques. how synthetic data can augment real datasets to enhance model
• Presents an overview of video generation and detection techniques. performance scalability analysis and simulate complex scenarios
that are challenging to capture with real data. They also address
• Provides a list of the main video and image datasets used by
concerns about synthetic datasets, such as identity leakage and
researchers.
lack of diversity.
• Describes trends, challenges and research directions that can be
• Figueira et al. [25] focused on the generation of synthetic data
explored in the AI generation, in video and image, and supports
with Generative Adversarial Networks (GANs). The authors
them with the conclusions of the analysis.
emphasize the significance of synthetic data, particularly in cases
By providing a thorough examination of both the generative and where data is limited or contains sensitive information. They
detection aspects of synthetic media, this survey aims to foster a deeper highlight how GANs can proficiently create high-quality synthetic
understanding of the current challenges and opportunities in the field, samples that imitate real data distributions. This study presents a
promoting the development of technologies that can maximize the detailed summary of current methods and challenges in synthetic
benefits of AI-generated content while minimizing its risks. data generation, emphasizing the utilization of GANs for diverse
This survey is structured to comprehensively address both the data types, including tabular data, and exploring various GAN
generative capabilities and detection techniques of AI-generated architectures that cater to these requirements.
images and videos, see Fig. 1. Section II reviews related works and • Nguyen et al. [26] offered a comprehensive review of deepfake
surveys, providing a foundation for understanding the current state generation and detection methods using deep learning techniques.
of research in this domain. Section III dives into image generation They explored different types of deepfakes, such as face-swaps, lip-
and detection, detailing various advanced generative models and syncs, and puppet-master variations, while highlighting the progress
the methods used to detect synthetic images. Section IV focuses on and challenges in identifying these manipulations. The survey
video generation and detection, exploring the advancements in video covers traditional and deep learning-based approaches for detecting
generation and the techniques to identify AI-generated videos. Section deepfakes, including methods based on manual feature creation and
V discusses the datasets used for generative and detection algorithms, those utilizing deep neural networks. Their work emphasizes the
highlighting the importance of diverse and high-quality datasets. importance of developing robust detection algorithms to counter

- 174 -
Regular Issue

Survey of AI-generated
images and video

Comparison against Challenges and


Image Video Dataset Conclusions
other surveys Future trends

Generation Generation Image

Detection Detection Video

Fig. 1. Schematic representation of the structure followed.

TABLE I. Comparison of Previous Literature Reviews


Task analysed Modalities
Authors Year Main Contribution Limitations
Generation Detection Image Video
It only deals with the generation of
It provided an extensive review on the
Liu et al. [24] 2022     human images, without covering
generation of human images
other possible scenarios.
It provides a detailed analysis of video and
Focus on the manipulation of video
Zhang et al. [22] 2022   image sample manipulation and detection
and image samples.
techniques.
It provides a very detailed analysis of
the use of GANs within data generation, It does not focus on image and video
Figueira et al. [25] 2022    
focusing on training problems and generation.
evaluation techniques.
It is mainly focused on the
It analyses both the techniques of
manipulation of multimedia data,
Nguyen et al. [26] 2022     generation, or manipulation, and the
not so much on the generation of
detection of images and videos.
synthetic samples.
Performs a detailed analysis of The focus is not on synthetic sample
Tyagi et al. [23] 2023     manipulation and detection techniques for generation and detection techniques,
video and audio samples. but on manipulation techniques.
It performs one of the most comprehensive It is not focused on the generation of
Bauer et al. [27] 2024    
data generation analyses available. image and video samples.
This is a very limited survey, as
It covers one of the newest approaches
it covers only one of the imaging
Chen et al. [28] 2024     to image generation, diffusion models for
approaches, without analysing other
Text-to-image task.
techniques or modalities.
Explores techniques including improving
It only focuses on generating samples
model performance, increasing data
Joshi et al. [29] 2024     that represent humans, leaving a
diversity and scalability, and mitigating
large part of the field unstudied.
privacy issues.

the increasing complexity of deepfake creation techniques. This Their work also explores various detection strategies, ranging from
study holds particular relevance in developing new multimodal biometric and model features to machine learning-based methods.
approaches for deepfake detection, which are in alignment with They emphasize the persistent challenges arising from evolving
investigating cross-modality fusion strategies. deepfake technologies, the need for high-quality datasets, and
• Bauer et al. [27] examined Synthetic Data Generation (SDG) the absence of a standardized benchmark for detection methods.
models, analyzing 417 models developed over the past decade. This survey is essential for gaining insights into the current state
The survey classifies these models into 20 distinct types and 42 of generating and detecting deepfakes, which present significant
subtypes, providing a comprehensive overview of their functions challenges to privacy, security, and societal trust.
and applications. The authors identified significant model • Tyagi et al. [23] conducted a comprehensive analysis of image
performance and complexity trends, highlighting the prevalence and video forgery detection techniques, highlighting the
of neural network-based approaches in most domains, except various manipulation methods, such as morphing, splicing, and
privacy-preserving data generation. The survey also discusses retouching, and the challenges associated with detecting these
challenges, such as the absence of standardized evaluation metrics alterations in digital media. The survey also reviewed different
and datasets, indicating the need for enhanced comparative datasets used for training and evaluating forgery detection
frameworks in future research. algorithms, emphasizing the need for robust, generalized
• Zhang et al. [22] analysed the generation and detection of deepfakes, methods capable of detecting multiple types of manipulations
shedding light on both the progress made and the challenges across diverse visual datasets. This work provides a detailed
encountered in this area. They outline two main techniques for examination of both traditional and deep learning-based
creating deepfakes, face swapping and facial reenactment, and approaches, illustrating the advancements and limitations in the
discuss the impact of GANs and other deep learning methods. field of digital media forensics.

- 175 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

Women Faces...

Prompt

Midjourney DALL-E3

TEXT

Stable Diffusion Brad Realistic Imagen from DALL-E3

Fig. 2. Overview of the main approaches to image generation with AI.

Local spatial artifacts Zhong et al. [78], Mathys et al. [86]


Spatial artifacts
Global spatial artifacts Shiohara et al. [19], Coccomini et al. [84]
Al-Generated Images Detection

High-frequency artifacts Synthbuster et al. [76], Lorenz et al. [82]


Frequancy domain
artifacts
Low-frequency artifacts Poredi et al. [75]

Dimensionality-based Lorenz et al. [82]


Intrinsic
dimensionality Statical Deviation Ma et al. [80], Ojha et al. [85]

Vision-language models Keita et al. [88], Tan et al. [87]


Semantic
inconsistencies
Generalized semantic Sinitsa et al. [74]
anomalies

Fig. 3. Overview of AI-generated Image Detection.

As we can see, this survey has a number of advantages over other the samples. While this figure highlights key models used in text-to-
published reviews of the field. Firstly, it is the first work to focus image or text-to-video synthesis, other generative approaches are
exclusively on synthetic sample generation techniques, which also discussed in the subsequent sections.
provides a list of datasets published in recent years. It also analyses Models for synthetic images detection have also made
the approaches with which researchers are tackling the problem of substantial progress. These detection models have become more
detecting these synthetic samples. advanced, using deep learning techniques to identify subtle artifacts
and inconsistencies in generated images. As a result, they are crucial
III. AI Image Generation and Detection in differentiating between real and synthetic images, ensuring the
integrity of visual content. The ongoing evolution of these models
In this section, we will focus on the generation of images with AI indicates the dynamic nature of the field, with continuous research
techniques, as well as on the main approaches for their detection. As efforts focused on improving their precision and resilience [30], [31].
mentioned above, AI, more specifically Deep Learning (DL) has shown
significant progress in the fields of image generation and detection. A. Image Generation
Advanced models have greatly improved the ability to generate Within AI image generation, we will analyse two different
synthetic images, focusing on enhancing aspects such as image approaches, see Fig. 3. The first approach, Text-to-image synthesis,
quality and realism. Recent developments have led to improved will focus on generating image samples from text descriptions; while
training stability and higher-quality generated images, addressing the second approach, Image-to-image translation, focuses on
common challenges and allowing for the creation of diverse and modifying the original image while preserving some visual properties
realistic outputs. Innovations in model architectures have also in the final sample. A concise summary of the main image generation
provided greater control over the image generation process, resulting techniques is presented in Table II.
in even more varied and convincing synthetic images. Fig. 2 illustrates
a subset of AI-generated image and video techniques, specifically
focusing on generative models that rely on text or prompts to create

- 176 -
Regular Issue

TABLE II. Comprehensive Overview of a Few Synthetic Image Generation Techniques

Models Year Technique Target Outcome Data Used Open Source


NVAE [66] 2020 Hierarchical VAE High-fidelity images CelebA, FFHQ No
CogView [41] 2021 Transformer-based Text-to-image synthesis Diverse text and images Yes
StyleGAN3 [59] 2021 GAN-based High-quality images FFHQ, CelebA Yes
BigGAN [73] 2021 GAN-based Large-scale image synthesis ImageNet Yes
GLIDE [2] 2021 Diffusion-based Generate images from text prompts DALL-E's dataset Yes
DALL-E 2 [1] 2022 Transformer-based Text-to-image synthesis Custom, diverse content Yes
DiVAE [38] 2022 VQ-VAE with diffusion High-quality reconstruction ImageNet No
VQ-VAE-2 [65] 2022 VA E -based High-resolution images Large-scale datasets Yes
EfficientGAN [61] 2022 GAN-based Efficiency and quality Custom datasets Partial
Latent Diffusion [43] 2023 Diffusion-based Photorealistic images Various Yes
Custom image captioner
DALL-E 3 [51] 2023 Enhanced Transformer Improved prompt following No
dataset
Imagen [4] 2023 Transformer-based High-fidelity image synthesis Open Images, ImageNet No
Imagen2 [50] 2023 Style-conditioned diffusion Lifelike images with context Diverse dataset No
Muse [40] 2023 Transformer T5-XXL High-fidelity zero-shot editing CC3M, COCO No
SDXL [48] 2023 Stable Diffusion High-resolution image synthesis Custom dataset Yes
Comprehensive dataset with
StyleGAN-T [32] 2023 GAN-based High-quality image synthesis Yes
various text-image pairs
GALIP [35] 2023 GAN-based, utilizing CLIP Efficient quality image creation from text Diverse datasets Yes
High-resolution, detailed image generation Extensive datasets with
GigaGAN [33] 2023 Advanced GAN Yes
from text diverse image-text pairs
UFOGen [37] 2024 GAN and diffusion High-quality fast generation - No
RAPHAEL [49] 2024 Diffusion with MoEs Artistic images from text Subset of LAION-5B Yes
Ahmed et al. [36] 2024 GAN with spatial co-attention Enhanced image generation CUB, Oxford-102, COCO No

1. Text-to-Image Synthesis descriptions, by integrating specific control mechanisms within the


In this section, we will look at different approaches to creating GAN framework. This capability is essential for applications that
synthetic images from text. As this is a growing field we can observe require high fidelity between textual inputs and visual outputs, such
a variety of different techniques, such as GANs, transformers or as in digital media creation and automated content generation.
diffusion models. Other authors have explored the option of a combination
Generative Adversarial Networks: Some authors continue between GAN with other types of techniques such as the CLIP
to focus on GANs which, although not particularly novel, have model, such as Ming Tao et al. [35] have applied the pre-trained CLIP
competitive results in the field. For example, Sauer et al. [32] have model to Generative Adversarial Networks (GANs) to transform
improved the robust StyleGAN architecture to develop StyleGAN-T. the process of text-to-image synthesis. This innovative approach
This model tackles the challenge of producing visually diverse and enhances the efficiency and quality of the images created from
attractive images from textual descriptions at scale, effectively textual descriptions. By integrating CLIP into both the discriminator
speeding up the process while maintaining image fidelity. StyleGAN-T and generator, the model achieves strong scene understanding and
is trained on a comprehensive dataset containing various text-image domain generalization using fewer parameters and less training data.
pairs, ensuring diverse visual outputs. However, one limitation is the By leveraging diverse and extensive datasets, this method enables
potential for reduced accuracy in rendering complex scenes due to the generation of a broad range of intricate and visually appealing
the inherent challenges of text ambiguity and the current limitations images. This approach accelerates the synthesis process and ensures
of GANs in understanding nuanced textual descriptions. Kang et al. a smoother and more controllable latent space, thereby significantly
[33] proposed GigaGAN’s, an architecture that includes an improved reducing the computational resources typically required for high-
generator and discriminator that efficiently handle large-scale data, quality image synthesis.
allowing for the creation of diverse and visually compelling images. Ahmed et al. [36] proposed a novel approach that involves
However, like other large-scale GANs, GigaGAN requires significant simultaneously generating images and their corresponding
computational resources for training and has the potential to overfit foreground-background segmentation masks. This is achieved by using
precise textual descriptions if the training data lacks diversity. Despite a new Generative Adversarial Network (GAN) architecture named
these limitations, GigaGAN’s image synthesis capability is a powerful COS-GAN, which incorporates a spatial co-attention mechanism to
tool in AI-driven creative image generation, expanding the boundaries improve the quality of both the images and segmentation masks. The
of machine understanding and visualization of textual content. The innovative aspect of COS-GAN lies in its ability to handle multiple
model TextControlGAN [34] introduces an innovative method image outputs and their segmentations from textual descriptions,
to improve text-to-image synthesis by modifying the Generative thereby enhancing applications such as object localization and image
Adversarial Network (GAN) architecture. This modification aims editing. It was extensively tested on diverse datasets, including CUB,
to enhance control and precision in generating images from textual Oxford-102, and COCO. However, it faces challenges, such as the

- 177 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

high computational demand required for training and potential biases VQ-VAE decoder. CogView underwent training on extensive datasets,
embedded within the large-scale datasets used. These limitations could incorporating image-text pairs from diverse sources. Despite its
impact the generalizability and ethical deployment. By contrast, Xu remarkable capabilities, CogView does have limitations. The model
et al. [37], chose to combine these GAN with diffusion models. They demands substantial computational resources for training owing to its
proposed UFOGen, that offers a novel approach to generating high- expansive parameter size. Similar to numerous text-to-image models,
quality images from text quickly. Combining elements of Generative it encounters challenges with intricate or ambiguous text prompts,
Adversarial Networks (GANs) and diffusion models efficiently creates leading to less precise image generation. Additionally, dependence
images in a single step, eliminating the need for slower, multi-step on extensive datasets can introduce biases within the training data,
processes used by standard diffusion models. UFOGen’s training impacting the variety and impartiality of the generated images.
process is greatly improved by utilizing pre-trained diffusion models, CogView2 [42] used a sophisticated Transformer architecture to
which enhances efficiency and reduces training times. However, quickly generate high-quality images from text. The model begins by
similar to other generative models, UFOGen also faces limitations. producing low-resolution images and then progressively refines them
It depends on large-scale datasets that may contain biased or using super-resolution modules, ensuring detailed and consistent
inappropriate content, potentially leading to biased generated images, results. With a foundation built on a 6-billion-parameter Transformer,
which raises ethical concerns and affects the fairness and diversity of the model is trained on diverse datasets of text-image pairs, allowing it
the output. to handle tasks such as text-to-image generation, image infilling, and
Autoencoder models: Another approach we have seen in the captioning in multiple languages. Nevertheless, CogView2 requires
generation of images from text is autoencoder models. For example, substantial computational resources and careful tuning to balance
Saharia et al. [4] introduced Imagen, a text-to-image model using local and global coherence in the generated images.
classifier-free guidance (CFG) and a pre-trained T5-XXL encoder to Diffusion models. This is one of the topics that has attracted the
improve computational efficiency. The model’s key innovation is most researchers. Latent Diffusion Models (LDMs) [43] are a major step
using large language models to enhance image quality and text-image forward in high-resolution image synthesis, see Fig. 4. They achieve
alignment. Imagen generates images starting at 64×64 resolution, then this by using diffusion models within the latent space of pre-trained
upscales to 256×256 and 1024×1024 using super-resolution models. autoencoders. This reduces the computational requirements typically
Despite achieving a strong FID score of 7.27 on COCO, the model associated with diffusion models operating in pixel space while
faces challenges with dataset biases, high computational demands, and maintaining high visual fidelity. Incorporating cross-attention layers
difficulties in generating realistic human images. On the other hand Shi within the UNet backbone is a significant advancement in LDMs. It
et al. [38] developed DiVAE, which combines a VQ-VAE architecture enables the generation of high-quality outputs based on various
with a denoising diffusion decoder to create highly realistic images, input conditions, such as text prompts and bounding boxes. This
excelling in image reconstruction and text-to-image synthesis architecture supports high-resolution synthesis using a convolutional
tasks. Using a CNN encoder, the model first compresses images into approach. The model is trained to predict a less noisy version of the
latent embeddings and then reconstructs them into high-quality latent variable by focusing on essential semantic features rather than
images through a diffusion-based decoder. Trained on the ImageNet on high-frequency details that are often imperceptible.
dataset, DiVAE delivers superior performance in terms of FID scores
compared to models like VQGAN. However, the diffusion process is Conditioning
X Diffusion process
computationally intensive, requiring many steps, and the model is E
Semantic Map
restricted by the fixed image size determined by the training data. Text

Contrastive learning: it has also been shown that this type of Denoising U-Net Representations
learning is a good technique for tackling this type of task using AI Images
models. The CLIP model [39], created by OpenAI, has attracted the X D
attention of a large number of researchers. This model is able to
relate images and text by using contrastive learning, training on large
multimodal datasets to align visual and linguistic representations in
a shared space, allowing tasks such as image generation, search and
classification to be performed without the need for specific supervised Denoising step Cross attention Switch Skip connection

training. As a result, it is one of the most widely used approaches for


researchers to generate synthetic images from text. Fig. 4. Latent Diffusion Models architecture from Rombach et al. [43].
Tranformer: We have also analysed different research that has Anton et al. [44] present a new method for synthesizing images
used transformers for the generation of synthetic images. Muse [40] from a text by combining image-prior models with latent diffusion
is a Transformer designed for text-to-image generation. It utilizes a techniques. The model utilizes CLIP to map text embeddings to image
pre-trained T5-XXL language model to predict masked image tokens. embeddings and incorporates a modified MoVQ implementation as
Trained on 460 million text-image pairs from CC3M and COCO the image autoencoder. After training on the COCO-30K dataset,
datasets, this model excels in generating high-fidelity images and Kandinsky achieves high-quality image generation with a competitive
supports zero-shot editing, such as inpainting and outpainting. Muse’s FID score. Despite the need for further improvements in the
efficiency exceeds that of diffusion and autoregressive models due semantic coherence between text and generated images, Kandinsky’s
to its discrete token space and parallel decoding. However, it faces versatility in supporting text-to-image generation, image fusion, and
challenges in rendering long phrases, handling high object cardinality, inpainting represents a significant advancement in AI-driven image
and managing multiple cardinalities in prompts. Ming Ding et al. [41] synthesis. EmoGen [45] marks a significant leap forward in text-to-
have introduced CogView. This model harnesses a 4-billion-parameter image models. It centers on producing images that capture distinct
Transformer architecture in combination with a VQ-VAE tokenizer. emotions, solving the difficulty of linking abstract emotions with
CogView operates by encoding text into discrete tokens, which the visual representations. This model excels at creating images that are
Transformer processes to forecast corresponding visual tokens. These semantically clear and resonate emotionally. It accomplishes this
visual tokens are then transformed into high-quality images using the by aligning the emotion-specific space with the powerful semantic

- 178 -
Regular Issue

capabilities of the CLIP model. This alignment is established through resolutions and aspect ratios. Firstly, it generates initial 128×128
a mapping network that interprets abstract emotions into concrete latents. Then, a specialized high-resolution refinement model is
semantics, guaranteeing that the generated images faithfully reflect applied to improve these latents to higher resolutions. The SDXL
the intended emotional tones. The model has undergone training and training involved utilising an improved autoencoder from previous
validation using EmoSet, a comprehensive visual emotion dataset Stable Diffusion versions. It exceeded its predecessors in all assessed
with detailed attribute annotations, aiding in optimizing the model reconstruction metrics, ensuring improved local and high-frequency
for diverse and emotionally accurate image generation. Despite its details in the generated images. The final training stage included
advancements, EmoGen faces challenges akin to other generative multi-aspect training with different aspect ratios, further boosting the
models, including reliance on potentially biased large datasets and the model’s capabilities. Despite its progress, SDXL has some limitations.
substantial computational resources needed for training and inference, The model’s reliance on large-scale datasets can lead to biases and
limiting its accessibility and applicability across different research ethical concerns due to potentially inappropriate content such as
groups and practical uses. pornographic images, racist language, and harmful social stereotypes.
Latent Diffusion Models (LDMs) also have their limitations. One SDXL also struggles to create realistic images of people, often resulting
significant challenge is the use of large-scale, often uncurated datasets, in lower preference rates. Furthermore, the model perpetuates existing
which can introduce biases and ethical concerns. While LDMs are social biases, favouring lighter skin tones. Xue et al. [49] presents
more computationally efficient than traditional pixel-based diffusion Raphael, an innovative method for generating images from text.
models, they still require substantial computational resources for It aims to create highly artistic images that closely match complex
training and inference, which may be prohibitive for smaller research textual prompts. The model stands out for its mixture-of-experts
groups. LDMs also struggle with generating realistic images of people, (MoEs) layers, incorporating both space-MoE and time-MoE layers,
leading to lower preference rates in evaluations. Additionally, these allowing for billions of unique diffusion paths. This distinct approach
models can reflect societal biases, highlighting the importance of enables each path to function as a "painter," translating individual
robust bias mitigation strategies and the need for more ethically curated parts of the text into corresponding image segments with high fidelity.
datasets in future research. Hang Li et al. [46] present an innovative RAPHAEL has outperformed other state-of-the-art models like Stable
approach focusing on the ethical implications of AI-generated content Diffusion and DALL-E 2. It excels in generating images across diverse
and introduce a self-supervised method for identifying interpretable styles, such as Japanese comics and cyberpunk, and has achieved
latent directions within diffusion models. The objective is to mitigate impressively low zero-shot FID scores on the COCO dataset. Training
the generation of inappropriate or biased images, thus enhancing on a combination of a subset of LAION-5B and some internal datasets
control over the generated images and ensuring they align with has ensured a broad and diverse range of training images and text for
ethical standards while avoiding perpetuating harmful stereotypes. RAPHAEL.
The model has been trained on diverse datasets, allowing it to handle Several tools based on diffusion models have also emerged, such as
a broad scope of concepts sensitively and responsibly. However, the the following:
extensive reliance on datasets may introduce potential biases, while • Imagen2 [50]: this model can generate realistic images by
the high computational demand for processing these datasets presents improving the way it pairs images with captions in its training
challenges for accessibility and scalability. data. The model is adept at understanding context and can edit
Some researchers have chosen to combine the CLIP model with images, including inpainting and outpainting. It also offers
diffusion models. For example, Nichol et al. [2] introduced GLIDE, style conditioning, allowing for the use of reference images to
a text-to-image diffusion model that replaces class labels with text guide style adherence, providing greater flexibility and control.
prompts. It uses classifier guidance, with a CLIP model in noisy However, it struggles with complex object placement and specific
image space, and classifier-free guidance [47], which integrates text detail generation, and there is a possibility of biased content,
features directly into the diffusion process. GLIDE’s 3.5B parameter so safety measures are essential. Trained on a large and diverse
model encodes text through a transformer to generate high-quality dataset, Imagen2 achieves high-quality, contextually aligned
images. While effective in photorealism and caption alignment, image generation.
GLIDE struggles with complex prompts and requires substantial • Dall-E3 [51]: has made significant strides in text-to-image
computational power. Ramesh et al. [1] introduced DALL·E 2, a model generation through the use of improved image captions to enhance
leveraging CLIP and diffusion techniques for generating realistic prompt following. By developing a custom image captioner
images from text descriptions. DALL·E 2 operates in two stages: to generate detailed, synthetic captions, the model has greatly
a prior model creates a CLIP image embedding from text, followed improved its ability to follow prompts, coherence, and the overall
by a diffusion-based decoder that generates the final image. This aesthetics of the generated images. However, DALL-E 3 still
architecture ensures both diversity and realism in the output. The grapples with issues such as spatial awareness, object placement,
model’s use of CLIP embeddings captures semantic and stylistic unreliable text rendering, and the tendency to hallucinate specific
nuances, enabling high-quality image generation and manipulation. details like plant species or bird types. The model’s training
Although trained on a vast dataset, DALL·E 2 faces challenges with consists of a mix of 95% synthetic captions and 5% ground truth
complex prompts and fine-grained attribute accuracy, highlighting captions, which helps regulate inputs and prevent overfitting.
areas for further improvement. This thorough training process allows DALL-E 3 to produce high-
Furthermore, Podell et al. [48] developed SDXL, which is a major quality images with improved prompt following and coherence.
step forward in high-resolution image synthesis, expanding on the As we have seen in this section, we have analysed the different
foundational work of Stable Diffusion models. It utilizes a significantly approaches that are currently being researched within the domain of
larger UNet backbone, about three times larger than its predecessors, text-to-image synthesis. The most commonly used techniques have
with more attention blocks and a larger cross-attention context. This been GANs, Transformers, Diffusion Models and the CLIP model. This
enhanced architecture enables SDXL to tackle complex text-to-image shows that there are a large number of synthetic image generation
synthesis tasks effectively. Additionally, SDXL incorporates multiple techniques that will allow the creation of large datasets created with
innovative conditioning schemes and is trained on various aspect many different techniques. This will allow the creation of detection
ratios, enhancing its versatility in producing images of different models that are able to generalise better to real situations.

- 179 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

2. Image-to-Image Translation tasks, making it a valuable tool for applications in fields such as art,
Recent advances in image-to-image translation have introduced design, and scientific simulations. ViT enables more complex and
several cutting-edge models that enhance generated images’ quality, nuanced image transformations, pushing the boundaries of synthetic
efficiency, and versatility. image generation possibilities.
Computer vision is one of the most important fields where GANs Recently, significant developments have been made in Variational
are applied, and realistic image generation is the most widely used Autoencoders (VAEs) for synthetic image generation. These
application of these techniques. For example, Augmented CycleGAN advancements have resulted in the creation of innovative models
[52] builds on the traditional CycleGAN architecture to handle that enhance the quality, efficiency, and versatility of the images
more complex image-to-image translation tasks, improving domain generated. For instance, Conditional VAEs [64] have improved
adaptation, style transfer, and reducing artifacts. DualGAN++ inpainting results and training efficiency by utilizing pre-trained
[53] introduces advanced regularization techniques and optimized weights and datasets such as CIFAR-10, ImageNet, and FFHQ. VQ-
training strategies, resulting in higher fidelity and fewer distortions VAE-2 employs hierarchical latent representations to capture high-
in synthetic images. CUT++ [54] refines the original CUT model resolution details, leading to a notable improvement in image fidelity
with contrastive learning techniques and enhanced loss functions for and diversity [65]. NVAE [66], with its hierarchical architecture and
generating higher-quality synthetic images, especially in scenarios advanced regularization techniques, has enabled high-resolution,
with limited data availability. SPADE++ [55] incorporates new realistic image generation. Another example is StyleVAE [67], which
strategies for better handling spatial inconsistencies and enhancing integrates VAEs with style transfer techniques to produce visually
the realism of high-resolution synthetic images, particularly effective appealing images with stylistic consistency. Additionally, FHVAE
for images with complex structures. SSIT-GAN [56] leverages self- has enhanced the disentanglement of latent factors, allowing for
supervised learning techniques to generate high-quality synthetic better control over image attributes [68]. EndoVAE [69], developed by
images with self-supervised loss functions, useful for applications with Diamantis et al., introduces a fresh approach for producing synthetic
limited annotated data. UMGAN [57] proposes a unified approach for endoscopic images using a Variational Autoencoder (VAE). This novel
multimodal image-to-image translation, enabling the generation of technique addresses the drawbacks of traditional GAN-based models,
diverse synthetic images from multiple input modalities across various particularly in the domain of medical imaging where maintaining data
applications. Zero-shot GANs [58] aim to generate images without privacy and diversity is crucial. EndoVAE is specifically designed to
extensive labelled data, enhancing the zero-shot learning capabilities generate a diverse set of high-quality synthetic images, which can
of GANs. This approach allows for the creation of diverse and high- be used in lieu of real endoscopic images. This aids in the training
quality images even with minimal training data. of machine learning models for medical diagnosis. The outcomes
illustrate that EndoVAE adeptly creates realistic endoscopic images,
Recent advancements in GAN-based synthetic image generation positioning it as a promising tool for advancing medical image
have been focused on enhancing image quality, efficiency, and analysis and circumventing the challenges stemming from limited
usability across different domains. StyleGAN3 tackles the issue of data availability.
"texture sticking" in generated images by introducing architectural
revisions to eliminate aliasing, ensuring that image details move Furthermore, Dos Santos et al. [70] have introduced a Synthetic
naturally with depicted objects. The new design interprets all signals Data Generation System (SDGS) that utilizes Variational Autoencoders
continuously, achieving full equivariance to translation and rotation (VAEs) to produce synthetic images. Their system aims to automate
at subpixel scales. This results in images that maintain the high the creation of synthetic datasets by using the Linked Data (LD)
quality of StyleGAN2 but with improved internal representations, paradigm to collect and merge data from multiple repositories.
making StyleGAN3 more suitable for video and animation generation. The SDGS framework incorporates advanced feature engineering
The model was trained using high-quality datasets such as FFHQ, methods to enhance the quality of the dataset before training the
METFACES, AFHQ, and a newly collected BEACHES dataset. VAE model. This results in synthetic images that closely mimic real-
However, the architecture assumes specific characteristics of the world data, making them extremely useful for training machine
training data, which can lead to challenges when these assumptions learning models, especially in scenarios where actual data is scarce.
are not met, such as with aliased or low-quality images. Additionally, The system’s efficacy has been confirmed through various case
further improvements might be possible by making the discriminator studies, demonstrating that the generated synthetic data achieves
equivariant and finding ways to reintroduce noise inputs without high accuracy and closely resembles the original datasets in crucial
compromising equivariance [59], [60]. EfficientGAN [61] focuses characteristics. Seunghwan et al. [71] have introduced a new method
on optimizing computational efficiency while maintaining high- for creating synthetic data using Variational Autoencoders (VAEs).
quality image generation. This model aims to reduce the resource Their approach overcomes the limitations of the typical Gaussian
requirements for training GANs without compromising the generated assumption in VAEs by incorporating an infinite mixture of asymmetric
images’ visual quality. It introduces novel architectural modifications Laplace distributions in the decoder. This advancement provides
and training strategies that balance performance and efficiency. more flexibility in capturing the underlying data distribution, which
is crucial for generating high-quality synthetic data. Their model,
Other authors have explored how to combine GANs with known as "DistVAE," has demonstrated exceptional performance in
other types of techniques such as Latent Diffusion Models, which generating synthetic datasets that maintain statistical similarity to the
combine GANs with diffusion models to achieve high-resolution original data and also ensures privacy preservation. The effectiveness
image synthesis. The integration of latent diffusion models helps in of the approach was confirmed through experiments on various real-
generating detailed and high-quality images while maintaining the world tabular datasets, indicating that DistVAE can generate accurate
robustness of GANs [62]. In contrast, Torbunov et al. [63] chose to synthetic data while allowing for adjustable privacy levels through a
combine them with Transformers. They introduced UVCGAN, an tunable parameter. This makes it particularly valuable in situations
advanced model designed for image-to-image translation, focusing on where data privacy is a concern.
synthetic image generation. This model improves upon the traditional
CycleGAN framework by integrating a Vision Transformer (ViT) Finally, we can see how the use of diffusion models in image-
into the generator, enhancing its ability to learn non-local patterns. to-image translation is also beginning to be explored. For example,
UVCGAN is highly effective for unpaired image-to-image translation Parmar et al. [72] proposed pix2pix-zero, a method for image-to-image

- 180 -
Regular Issue

TABLE III. Overview of Techniques for Detecting AI-Generated Images

Authors Year Technique Target Outcome Data Used Open Source


Shiohara et al. [19] 2022 Self-blended images Detect fake or synthetic images Self-blended image data Yes
Diffusion
Wang et al. [79] 2023 Detect difusión model-generated images DiffusionForensics dataset Yes
Reconstruction Error
Deterministic reverse and CIFAR-IO, TinyImageNet,
Ma et al. [80] 2023 Detect images from difusión models Yes
denoising computation errors CelebA
Datasets from 17 generative
Zhong et al. [78] 2023 Texture patch analysis Identify Al-generated images Yes
models
Intrinsic Dimensionality- Detect artificial images from deep diffusion CiFake, ArtiFact, DiffusionDB,
lorenz et al. [82] 2023 Yes
based models LAION-5B, SAC
Wavelet-packet representation FFHQ, CelebA, LSUN, Face
Alzantot et al. [77] 2023 Differentiate real and synthetic images Yes
analysis Forensics++
Poredi et al. [75] 2023 Frequency analysis Identify Al-generated images on social media Stanford image dataset Yes
Bammey et al. [76] 2023 Frequency artifacts analysis Detect images generated by diffusion models Raise and Dresden datasets Yes
Guarnera et al. [83] 2023 Hierarchical classification Identify deepfake images CelebA, FFHQ, ImageNet Yes
Images generated by various
Ojha et al. [85] 2023 Universal fake image detector Enhance detection of synthetic or fake images Yes
models
CNN-based pixel-level Diverse dataset with real and
Mathys et al. [86] 2024 Identify synthetic images No
analysis synthetic images
Visual and textual feature Detect synthetic images from diffusion MSCOCO and Wikimedia
Coccomini et al. [84] 2024 Yes
classification models datasets
Category Common Prompt Images generated by various
Tan et al. [87] 2024 Enhance detection of deepfakes Yes
in CLIP models
Detect synthetic images with low-budget
Sinitsa et al. [74] 2024 Fingerprint-based Various models datasets Yes
models
Vision-language model with Detect synthetic images using vision-language
Keita et al. [88] 2024 Various datasets Yes
dual LORA mechanism model

translation without relying on text prompts or additional training. This ProGAN, BigGAN, StyleGAN, Stable Diffusion, DALL·E-2, and GLIDE,
approach utilizes cross-attention guidance to maintain image structure and achieves high detection accuracy with minimal training samples.
and automatically discovers editing directions in the text embedding While it excels in detecting synthetic images, it may encounter some
space. The architecture leverages pre-trained Stable Diffusion models challenges with models like GLIDE and DALL·E-2 due to their weaker,
for tasks like object type changes and style transformations. The less distinct fingerprints.
model’s performance is assessed using real and synthetic images from Some authors still opt for more traditional techniques, such as
the LAION 5B dataset. However, some limitations include the low the Fourier Transform for the detection of artefacts left in the
resolution of the cross-attention map for fine details and challenges image samples. For example, the AUSOME (AUthenticating SOcial
with atypical poses and fine-grained edits. MEdia) [75] method is focused on identifying AI-generated images
In this section we have analysed the latest work in the field of on social media. It achieves this by utilizing frequency analysis
Image-to-Image translation, focusing on image alterations while techniques, such as the Discrete Fourier Transform (DFT) and
maintaining some visual features. Within this domain we have looked Discrete Cosine Transform (DCT), to compare the spectral features
at three main approaches: GANs, AutoEncoders and diffusion models. of AI-generated images, like those produced by DALL-E 2, with
We can observe that this domain although it has been widely explored, legitimate images from the Stanford image dataset. AUSOME can
still presents a wide range of possibilities. distinguish between AI-generated and real images by examining
differences in frequency responses. Although it demonstrates high
B. Detection of AI-Generated Images accuracy, it may encounter difficulties when dealing with images
The development of generative models requires the creation of where semantic content is essential for determining authenticity.
detection models to differentiate between AI-generated and real Nevertheless, this method presents a promising approach for
images. Detection methods can be split into two main types: those verifying social media images, particularly in light of the increasing
focused solely on improving detection performance and those that prevalence of AI-generated content. Synthbuster [76] is a technique
enhance detectors with additional features such as generalizability, developed to identify images created by diffusion models by
robustness, and interpretability while maintaining accurate and analyzing frequency artifacts in the Fourier transform of residual
effective detection capabilities. An overview of techniques for images. This method is effective at spotting synthetic images, even
detecting AI-generated images is provided in Table III, summarizing when they are slightly compressed in JPEG format, and it works
various methods and their key features, including the application areas well with unknown models. It analyzes real images from the RAISE
and datasets used. For example, the Deep Image Fingerprint (DIF) [74] and Dresden datasets and synthetic images from various models
method is specifically designed to detect low-budget synthetic images. such as Stable Diffusion, Midjourney, Adobe Firefly, DALL·E 2,
It can identify images generated by both Generative Adversarial and DALL·E 3. While Synthbuster is generally effective, it may
Networks (GANs) and Latent Text-to-Image Models (LTIMs). The encounter challenges when dealing with different compression
method utilizes datasets from various models, including CycleGAN, levels and diverse image categories.

- 181 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

Other authors focus on taking advantage of textures, in order [82] method is developed to identify artificial images produced
to exploit all available information. For instance, Alzantot et al. [77] by deep diffusion models. This method utilizes the local intrinsic
proposed multi-scale wavelet-packet representations. Their deepfake dimensionality of feature maps extracted by an untrained ResNet18,
image analysis and detection technique aims to differentiate real making it efficient and not relying on pre-trained models. It has been
from synthetic images by analyzing their spatial and frequency evaluated on various datasets like CiFake, ArtiFact, DiffusionDB,
information. This method has undergone evaluation using various LAION-5B, and SAC, demonstrating high accuracy in detecting
datasets, including FFHQ, CelebA, LSUN, and FaceForensics++. It has artificial images from models including Glide, DDPM, Latent Diffusion,
shown strong capabilities in identifying GAN-generated images, such Palette, and Stable Diffusion. However, multiLID may have limitations
as those created by StyleGAN. However, it may face challenges when in its ability to perform well on unfamiliar data from different datasets
analyzing complex images where semantic information is crucial, or models within the same domain. Guarnera et al. [83] developed a
and its effectiveness may be limited to the detection of image-based hierarchical multi-level approach for detection and identification of
synthetic media. PatchCraft [78] introduces a fresh approach to deepfake images produced by GANs and Diffusion Models (DMs).
identifying synthetic AI-generated images. Instead of relying solely This method utilizes ResNet-34 models at three levels of classification:
on global semantic information, this method focuses on analyzing distinguishing genuine images from AI-generated ones, discerning
texture patches within the images for more effective detection. To between GANs and DMs, and identifying specific AI architectures.
enhance detection, the method employs a preprocessing step called Their dataset comprises authentic images from CelebA, FFHQ, and
Smash&Reconstruction, which removes global semantic details and ImageNet, as well as synthetic images from nine GAN models (e.g.,
amplifies texture patches, thereby utilizing the contrast between rich AttGAN, CycleGAN, ProGAN, StyleGAN, StyleGAN2) and four
and poor texture regions to boost performance. Tested on datasets diffusion models (e.g., DALL-E 2, GLIDE, Latent Diffusion), totalling
from 17 common generative models, including ProGAN, StyleGAN, 42,500 synthetic and 40,500 real images. With an accuracy of over 97%,
BigGAN, CycleGAN, ADM, Glide, and Stable Diffusion, the method the method demonstrates strong performance, but it may encounter
has shown superior adaptability and resilience against previously challenges related to real-world robustness, such as JPEG compression
unseen models and image distortions. Nevertheless, it may encounter and complex image features.
challenges when dealing with images in which semantic information However, other authors have opted for different architectures
is critical for accurate detection. rather than CNNs. Coccomini et al. [84] investigate the detection of
An analysis on the error inserted in generated images has also synthetic images generated by diffusion models, such as those created
been a productive research line. For example, the DIRE (DIffusion with Stable Diffusion and GLIDE. Their approach involves using
REconstruction Error) [79] method is utilized to identify images classifiers like multi-layer perceptrons (MLPs) and convolutional
created through diffusion processes by comparing the reconstruction neural networks (CNNs) to distinguish synthetic images from real
error between an original image and its reconstructed version using a ones. The model is trained on datasets like MSCOCO and Wikimedia,
pre-trained diffusion model. This technique is based on the idea that focusing on leveraging visual and textual features for effective
diffusion-generated images can be accurately reconstructed using detection. A notable limitation of the study is the challenge of cross-
diffusion models, unlike genuine images. DIRE has been evaluated method generalization, where models trained on one type of synthetic
using the DiffusionForensics dataset, encompassing images from image struggle to detect images generated by different methods. This
various diffusion models, including ADM, DDPM, and iDDPM. work underscores the complexities of detecting AI-generated images,
It has demonstrated notable accuracy in detecting images and is particularly as diffusion models become more sophisticated. Ojha et al.
resilient to unseen diffusion models and alterations. Nonetheless, it [85] have introduced a method to enhance the detection of synthetic
may encounter difficulties with the intricate features of real images. or fake images generated by various models, including GANs and
Shiohara et al. [19] has introduced an innovative approach for diffusion models. Their approach aims to create a universal fake image
detecting fake or synthetic images, specifically deepfakes. They utilize detector that performs well across different generative models. This
self-blended images (SBIs) as synthetic training data to enhance the is achieved through a combination of convolutional neural networks
robustness of detection models. This allows the models to effectively (CNNs) and advanced training techniques to identify subtle anomalies
identify various types of deepfake manipulations by scrutinizing commonly found in AI-generated images. The model is trained on
inconsistencies and artifacts in the images. Consequently, this method diverse datasets, incorporating images generated by various models
provides a robust tool for preserving the authenticity of digital to improve its reliability. However, the study highlights a challenge in
media in the face of increasingly advanced generative techniques. maintaining high detection accuracy when faced with new generative
The SeDID [80] method utilizes deterministic reverse and denoising models not included in the training set, indicating the need for further
computation errors found in diffusion models. This approach includes improvements to achieve universal detection capabilities. Mathys et al.
two branches: the statistical-based SeDIDStat and the neural network- [86] present a method for identifying synthetic images produced by AI
based SeDIDNNs. SeDID was evaluated on various datasets like models. The focus is on spotting subtle artifacts and inconsistencies
CIFAR-10, TinyImageNet, and CelebA and demonstrated superior that are indicative of AI-generated content. Their proposed architecture
detection accuracy and robustness against unseen diffusion models utilizes a convolutional neural network to scrutinize pixel-level details
and perturbations. However, the method may encounter challenges and capture the distinct markers left by generative models. Training
when dealing with the complex features of real images. Nevertheless, the model on a diverse dataset containing both real and synthetic
SeDID underscores the importance of selecting the optimal timestep to images from various sources makes it adept at generalizing across
enhance detection performance. different types of AI-generated content. This method significantly
As expected, another approach widely used by state-of-the- boosts the accuracy of detecting fake images, effectively tackling
art researchers is Convolutional Neural Networks, which have the challenges brought about by the increasingly lifelike outputs of
demonstrated excellent performance on numerous similar classification modern generative models. This research holds particular significance
problems [81], making it one of the most explored techniques. Some in upholding the authenticity and integrity of digital content in an age
authors continue to rely on classical architectures such as ResNet. It where synthetic media is increasingly prevalent.
continues to perform competitively on many classification problems. Lastly, we will analyse some research that has chosen other novel
Among them, The multi-local Intrinsic Dimensionality (multiLID) approaches such as the use of models like CLIP or vision-language

- 182 -
Regular Issue

Boy eating icecream

Gen2 SVD
Prompt

Prompt + Image
Generated By Gen2
Imagen

Fig. 5. Overviewof the main approaches to video generation with AI.

models. Tan et al. [87] introduce C2P-CLIP, a novel approach is to provide a comprehensive understanding of the methods and
designed to enhance the detection of AI-generated images, specifically techniques involved in future video content creation and analysis.
deepfakes, by injecting a Category Common Prompt (C2P) into the
CLIP model. CLIP (Contrastive Language-Image Pre-training) is a A. Video Generation
powerful model trained on various image-text pairs, which allows In video content creation, generative models are beginning to
it to understand and match images and text descriptions effectively. revolutionize production and consumption by automating the generation
However, its application to deepfake detection has been limited by of realistic and high-quality videos. Recently, a surge of generative
its generalization capability across different types of manipulations. video models capable of various video creation tasks has emerged. In
The C2P-CLIP method addresses this limitation by incorporating a this section we are going to analyse five different approaches: Text-to-
category-specific prompt that captures standard features across related video, deep learning techniques that generate synthetic video samples
deepfakes, improving the model’s ability to generalize beyond the from text descriptions; image-to-video techniques that transform static
specific types of manipulations seen during training. This technique images to dynamic video; video-to-video, a set of techniques focused
leverages the extensive pre-training of CLIP while fine-tuning its on the generation of realistic video sequences by transforming or
capacity to identify subtle inconsistencies and artifacts introduced by translating visual information from one video domain to another; Text-
deepfake generation techniques. Through comprehensive experiments, Image-to-Video which generates synthetic video samples from a real
the authors demonstrate that C2P-CLIP significantly outperforms image and a text description; Multimodal video generation, this field
existing methods on several benchmark datasets, showing superior focuses not only on the generation of the visual part of the video but
performance in detecting a wide range of AI-generated manipulations. also on the audio part of the video, from different inputs, such as text,
Keita et al. [88] present Bi-LORA, a vision-language approach designed image, video or audio. Deep learning-based generative models such as
to detect synthetic images. Bi-LORA effectively captures the unique GANs, Variational Autoencoders (VAEs), autoregressive, and diffusion-
features and artefacts of AI-generated images by leveraging a dual Low- based models have remarkably succeeded in generating realistic and
Rank Adaptation (LORA) mechanism within a vision-language model. diverse content. By training on large datasets, these models learn
The method integrates visual and textual information, enhancing the underlying data distribution, enabling them to generate samples
its ability to differentiate between real and synthetic content more that closely resemble the original data. Fig. 6 illustrates the various
accurately. Through extensive experiments, Bi-LORA demonstrates categories of video generation.
significant improvements in detection performance over traditional
methods, highlighting its potential as a robust tool for identifying AI-
generated images across various datasets. Image-to-Video
(I2V)
Lastly, we have analysed the most recent research into the detection
of synthetic image. This field is highly dependent on the previous Text-Image-to-Video
one, as quality datasets will be needed, i.e. with intra-class variability, Text-to-Video
(TI2V)
enough quality and resolution, and representativeness, allowing the (T2V)
creation of models that can be used in real situations. In this domain
we have seen that the main approaches explored by researchers are
CNNs, and vision-language models, although other more traditional Video-to-Video
(V2V)
approaches are still used.

IV. Video Generation and Detection


Fig. 6. Categories of video generation methods.
In recent years, the field of video generation has attracted
significant attention, due to advancements in artificial intelligence,
1. Text-to-Video Synthesis
machine learning, and the emergence of diffusion models (see Fig. 5),
this has forced researchers to develop new techniques to detect these Generating photo-realistic videos presents significant challenges,
synthetic samples. This section provides an overview of the current particularly when it comes to maintaining high fidelity and continuity
state of video generation methods, which are increasingly being used of motion over extended sequences. Despite these difficulties, recent
to create high-quality, realistic videos across different applications. advancements have utilized diffusion models to enhance the realism of
Additionally, it explores the challenges and methods associated with video generation. Text, being a highly intuitive and informative form
detecting AI-generated videos, an area of growing importance as of instruction, has become a central tool in guiding video synthesis,
these technologies become more sophisticated. The aim of this section leading to the development of Text-to-video (T2V) generation

- 183 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

models. This approach focuses on creating high-quality videos based EMU VIDEO efficiently achieves high-resolution video generation,
on text descriptions, acting as a conditional input for the video maintaining the conceptual and stylistic diversity learned from large
generation process. image-text datasets. Similarly, Wang et al. [97] proposed LaVie, a
To address the challenges in text-to-video synthesis, existing cascaded framework for Video Latent Diffusion Models (V-LDMs)
methods primarily extend Text-to-image models by incorporating conditioned on text descriptions. LaVie is composed of three networks:
temporal modules, such as temporal convolutions and temporal a base T2V model for generating short, low-resolution key frames,
attention, to establish temporal correlations between video frames. a Temporal interpolation (TI) model for increasing the frame rate
A notable example is the work by Ho et al. [89], who introduced and enriching temporal details, and a Video super-resolution model
Video Diffusion Models (VDM). This model extends text-to-image (VSR) for enhancing the visual quality and spatial resolution of the
diffusion models to video generation by training jointly on both videos. The base T2V model modifies the original 2D UNet to handle
image and video data. Their approach utilizes a U-Net-based spatio-temporal distributions and utilizes joint fine-tuning with both
architecture, which integrates joint image-video denoising losses, image and video data to prevent catastrophic forgetting, resulting in
ensuring temporal coherence by conditioning on both past and significant video quality improvements. The TI model uses a diffusion
future frames, thus resulting in smoother transitions and more UNet to synthesize new frames, enhancing video smoothness and
consistent motion. Building on this foundation, Ho et al. [90] coherence, while the VSR model adapts a pre-trained image upscaler
proposed Imagen Video, a novel approach for generating high- with additional temporal layers, enabling efficient training and high-
definition videos using diffusion models. Imagen Video employs quality video generation.
a cascaded video diffusion model approach, adapting techniques Further developments include the work by Menapace et al. [98],
from text-to-image generation, such as a frozen T5 text encoder and who proposed a method to generate high-resolution videos by
classifier-free guidance, to the video domain. It uses a hierarchical modifying the Efficient diffusion model (EDM) [99] framework
approach, beginning with a low-resolution video to capture the for high-dimensional inputs and developing a scalable transformer
overall structure and motion, which is then progressively refined to architecture inspired by Far-reaching interleaved transformerss (FITs)
higher resolutions. Temporal dynamics are managed by conditioning [100]. They adjust the EDM framework to handle high SNR in videos
each frame on previous frames, ensuring consistency throughout with a scaling factor for optimal denoising. This method addresses the
the video. Super-resolution techniques are subsequently applied to scarcity of captioned video data by jointly training the model on both
enhance the detail and quality of each frame. images and videos, allowing for more effective learning of temporal
In a different approach, Singer et al. [91] introduced Make-A-Video, dynamics. The video generation uses FITs, transformer models that
which generates videos from textual descriptions without relying reduce complexity by compressing inputs with learnable latent tokens
on paired text-video data. This methodology builds upon a text-to- and employing cross-attention and self-attention to focus on spatial
image synthesis model and incorporates spatio-temporal layers to and temporal information. The approach includes conditioning
extend it into the video domain. The approach integrates pseudo-3D tokens for text and metadata and uses a cascade model: the first stage
convolutional and attention layers to manage spatial and temporal generates low-resolution videos, and the second stage refines them
dimensions efficiently. Additionally, super-resolution networks into high-resolution outputs. During training, variable noise levels are
are employed to improve visual quality, and a frame interpolation introduced to the second-stage inputs to improve upsampling quality,
network is used to increase the frame rate and smooth out the aiming for effective high-quality video generation. In addressing
video output. Meanwhile, Zhou et al. [92] presented MagicVideo, a data scarcity, Chen et al. [101] designed VideoCrafter2, a model that
framework designed to generate high-quality video clips from textual improves spatio-temporal consistency in video diffusion models
descriptions. Instead of directly modeling the video in visual space, through a data-level disentanglement strategy. This approach separates
MagicVideo leverages a pre-trained Variational autoencoder (VAE) motion aspects from appearance features, leveraging low-quality
to map video clips into a low-dimensional latent space, where the videos for motion learning and high-quality images for appearance
distribution of videos’ latent codes is learned via a diffusion model. learning. This design strategy eases a targeted fine-tuning process
This approach optimizes computational efficiency and improves video with high quality images, with the aim of significantly increasing the
synthesis by performing the diffusion process in the latent space. visual fidelity of the generated content without compromising the
Further pushing the boundaries of video generation, Dan Kondratyuk precision of motion dynamics. Importantly, synthetic images with
et al. [5] proposed VideoPoet, an advanced language model for zero- complex concepts are used for finetuning, rather than real images, to
shot video generation. This model integrates the MAGVIT-v2 [93] enhance the concept composition ability of video models.
tokenizer for images and videos and the SoundStream [94] tokenizer Furthermore, Ma et al. [102] introduced Latte, a simple and general
for audio, enabling the processing and generation of multimedia video diffusion method that extends Latent diffusion models
content within a unified framework. VideoPoet employs a prefix (LDMs) for video generation by employing a series of transformer
language model with a decoder-only architecture as its backbone, blocks to process latent space representations of video data obtained
facilitating the creation of high-quality videos from textual prompts, from a pre-trained variational autoencoder. Latte specifically
along with interactive editing capabilities. VideoPoet is trained on a addresses the inherent disparities between spatial and temporal
diverse set of tasks without needing paired video-text data, allowing it information in videos by decomposing these dimensions, allowing
to learn effectively from video-only examples. It can generate videos for more efficient processing. The method includes four efficient
based on textual descriptions, animate static images, apply styles [95] Transformer-based model variants, designed to manage the large
to videos through optical flow and depth prediction, and even extend number of tokens extracted from input videos, thereby improving the
video sequences by iteratively predicting subsequent frames. overall performance and scalability of video generation. Li et al. [103]
In another innovative approach, Girdhar et al. [96] introduced introduced VideoGen, a text-to-video generation method that produces
EMU VIDEO, a two-stages Text-to-video generation model: first, high-definition videos with strong frame fidelity and temporal
it generates an image from text, and then it produces a video using consistency using reference-guided latent diffusion. In their approach,
both the text and the generated image. This method simplifies video an off-the-shelf T2I model like Stable diffusion (SD) generates a high-
prediction by leveraging a pretrained text-to-image model and freezing quality image from a text prompt, which then serves as a reference
spatial layers while adding new temporal layers for video generation. for video generation. This process involves a cascaded latent diffusion

- 184 -
Regular Issue

module conditioned on both the reference image and text prompt, an Image-GPT-like architecture with dropout for regularization,
followed by a flow-based temporal upsampling step that enhances which enables conditional sample generation via cross attention and
temporal resolution. Finally, a video decoder maps the latent video conditional norms. Blattmann et al. [110] introduced a novel approach
representations into high-definition videos, improving visual fidelity to efficient high-resolution video generation through Video LDMs, by
and reducing artifacts while focusing on learning video dynamics. adapting pre-trained image diffusion models into video generators.
The training process benefits from high-quality unlabeled video data, They achieve this by temporal fine-tuning with alignment layers,
using the first frame of a ground-truth video as the reference image to which maintains computational efficiency. Initially, an LDM is pre-
enhance motion smoothness and realism. trained on images and then transformed into a video generator by
Building on the VQ-VAE architecture, Godiva et al. [104] adding a temporal dimension and fine-tuning on video sequences.
proposed GODIVA, an open-domain text-to-video model pre-trained Additionally, diffusion model upsamplers are temporally aligned for
on the HowTo100M [105] dataset. This model generates videos consistent video super resolution, allowing the efficient training of
in an auto-regressive manner using a three-dimensional sparse high-resolution, long-term consistent video generation models using
attention mechanism. Initially, a VQ-VAE auto-encoder represents pre-trained image LDMs with added temporal alignment.
continuous video pixels as discrete video tokens. Subsequently, the Building on these advancements, Chen et al. [111] introduced two
three-dimensional sparse attention model utilizes language input diffusion models for high-quality video generation: T2V and Image-to-
alongside these discrete video tokens to generate videos, effectively video (I2V). The T2V model, based on SD 2.1, incorporates temporal
considering temporal, column, and row information. Similarly, attention layers to ensure temporal consistency and employs a joint
Ding et al. [106] advanced the field by introducing CogVideo, a image and video training strategy. The VideoCrafter T2V model
9B-parameter transformer built upon the pretrained text-to-image further leverages a Latent Video Diffusion Model (LVDM) with a
model CogView2 [42] for video generation. CogVideo employs a video VAE and a video latent diffusion model, where the VAE reduces
multi-frame-rate hierarchical training strategy, which aligns text with sample dimensions to improve efficiency. Video data is encoded into a
video clips by controlling frame generation intensity and ensuring compressed latent representation, processed through a diffusion model
accurate alignment between text and video content. This is achieved with noise added at each timestep, before being decoded by the VAE
by prepending text prompts with frame rate descriptions, which to generate the final video. He et al. [112] expanded on the concept of
significantly enhances generation accuracy, particularly for complex video generation by introducing a hierarchical LVDM framework that
semantic movements. Additionally, CogVideo’s dual-channel attention extends videos beyond the training length. Their method addresses
mechanism improves the coherence of generated videos by focusing performance degradation with conditional latent perturbation and
on both textual and visual cues simultaneously. This approach allows unconditional guidance. Their lightweight video diffusion models use
CogVideo to efficiently adapt a pretrained model for video synthesis a low-dimensional 3D latent space, significantly outperforming pixel-
without the need for costly full retraining. space models with limited computational resources. By compressing
Expanding on the capabilities of earlier models, Wu et al. [107] videos into latents using a video autoencoder and utilizing a unified
developed NUWA, a unified multimodal pre-trained model video diffusion model for both unconditional and conditional
designed for generating and manipulating visual data, including images generation, their approach generates videos autoregressively
and videos, across various visual synthesis tasks. NUWA utilizes a 3D and improves coherence and quality over extended lengths with
transformer encoder-decoder framework to process 1D text, 2D hierarchical diffusion.
images, and 3D videos. This model introduces a 3D nearby attention To further advance video generation, Wang et al. [113] proposed
(3DNA) mechanism that efficiently handles visual data, reduces ModelScope Text-to-Video (ModelScopeT2V), a simple yet
computational complexity, and enables high-quality synthesis with effective baseline for video generation. This model introduces two key
notable zero-shot capabilities. Further advancing this work, Wu et al. technical contributions: a spatio-temporal block to model temporal
[108] introduced NUWA-Infinity, a groundbreaking model for infinite dependencies in text-to-video generation, and a multi-frame training
visual synthesis capable of generating high-resolution images or long- strategy with both image-text and video-text paired datasets to
duration videos of arbitrary size. The model features an autoregressive enhance semantic richness. ModelScopeT2V evolves from a text-to-
over autoregressive generation mechanism, with a global patch-level image model (stable diffusion) and includes spatio-temporal blocks to
model managing inter-patch dependencies and a local token-level ensure consistent frame generation and smooth transitions, adapting
model handling intra-patch dependencies. To optimize efficiency, to varying frame numbers during training and inference. In the
NUWA-Infinity incorpores a Nearby context pool (NCP) to reuse realm of scalable and efficient video generation, Gupta et al. [114]
previously generated patches, minimizing computational costs while proposed W.A.L.T, a simple yet scalable and efficient transformer-
maintaining robust dependency modeling. Additionally, an Arbitrary based framework for latent video diffusion models. Their approach
direction controller (ADC) enhances flexibility by determining optimal consists of two stages: an autoencoder compresses images and videos
generation orders and learning position embeddings tailored for into a lower-dimensional latent space, allowing for efficient joint
diverse synthesis tasks. NUWA-Infinity thus transcends the limitations training on combined datasets. Subsequently, the transformer employs
of fixed-size approaches, enabling comprehensive and efficient content window-restricted self-attention layers that alternate between spatial
creation on a variable scale. In contrast to these approaches, Yan et al. and spatio-temporal attention, reducing computational demands and
[109] proposed VideoGPT, a simpler and more efficient architecture supporting joint image-video processing. This method facilitates
for scaling likelihood-based generative modeling to natural videos. high-resolution, temporally consistent video generation from textual
By employing VQ-VAE with 3D convolutions and axial self-attention, descriptions, offering an innovative approach to T2V synthesis. Villegas
VideoGPT learns downsampled discrete latent representations of raw et al. [115] contributed to the field by proposing Phenaki, a unique
videos. These representations are then autoregressively modeled by C-ViViT encoder-decoder structure for generating variable-length
a GPT-like architecture with spatio-temporal position encodings to videos from textual inputs. This model compresses video data into
generate videos. This method involves training a VQ-VAE with an compact tokens, allowing for the production of coherent and detailed
encoder that downsamples space-time and a decoder that upsamples videos. By utilizing a bidirectional masked transformer to translate
it, sharing spatio-temporal embeddings across attention layers. text tokens into video tokens, the model can generate long, temporally
Furthermore, a prior over the VQ-VAE latent codes is learned using coherent videos from both open-domain and sequential prompts. It

- 185 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

Inflated U-Net

Difussion
E process
D

Frozen U-Net Block


Tuned
Text

Convolutional Temporal Attention Text Cross


Shift Attention FNN
Block Adapter Adapter attention

FNN
Attention

Attention Block

Fig. 7. SimDA [116] architecture.

also improves video token compression by 40% by exploiting temporal testing. Shi et al. [120] introduced BIVDiff, a training-free video
redundancy, enhancing reconstruction quality and accommodating synthesis framework that integrates frame-wise video generation,
variable video lengths, while the causal variation of ViViT manages mixed inversion, and temporal smoothing. This framework bridges
temporal and spatial dimensions in an auto-regressive manner. the gap between specific image diffusion models (e.g., ControlNet,
Previous methods of text-to-video generation face high Instruct Pix2Pix) and general text-to-video diffusion models (e.g.,
computational costs with pixel-based VDMs or struggle with text- VidRD, ZeroScope). The process begins with frame generation using
video alignment with latent-based VDMs. To marry the strength an image diffusion model, followed by Mixed Inversion to adjust latent
and alleviate the weakness of pixel-based and latent-based VDMs, distributions, which balances temporal consistency with the open-
Zhang et al. [117] proposed Show-1, a hybrid model that combines generation capability of video diffusion models. Finally, video diffusion
both pixel-based and latent-based VDMs to overcome the limitations models are applied for temporal smoothing. This method effectively
of previous methods. By employing pixel-based VDMs to create low- addresses issues of temporal consistency and task generalization that
resolution videos with strong text-video correlation, and then using are common in previous training-free approaches.
latent-based VDMs to upsample these to high resolution, Show-1 Finally, Xing et al. [116] proposed a parameter-efficient video
ensures precise text-video alignment, natural motion, and high visual diffusion model called Simple Diffusion Adapter (SimDA), see Fig.
quality with reduced computational cost. Khachatryan et al. [118] 7, which fine-tunes the large T2I model (i.e., Stable Diffusion) for
built upon the Stable diffusion T2I model to develop Text2Video- enhanced video generation. SimDA generates videos from textual
Zero, a zero-shot T2V synthesis model. This approach enriches prompts through efficient one-shot fine-tuning of pre-trained
latent codes with motion dynamics to ensure temporal consistency Stable Diffusion models, focusing on a parameter-efficient approach
and employs a cross-frame attention mechanism to maintain object by fine-tuning only 24 million out of the 1.1 billion parameters. The
appearance and identity across frames. Although Text2Video-Zero model employs an adapter with two learnable fully connected layers,
enables high-quality, temporally consistent video generation from incorporating spatial adapters to capture appearance transferability
textual descriptions without additional training, leveraging existing and temporal adapters to model temporal information, utilizing
pre-trained T2I models, there is still potential for improvement. It GELU activations and depth-wise 3D convolutions. Additionally,
struggles to generate longer videos with sequences of actions. SimDA introduces Latent-shift attention (LSA) to replace the
Furthermore, FuWeng et al. [119] introduced ART•V, an efficient original spatial attention, enhancing temporal consistency without
framework for autoregressive video generation using diffusion adding new parameters. More recently, Qing et al. [121] presented
models. ART•V generates frames sequentially, conditioned on HiGen, a diffusion-based model that improves video generation by
previous frames, by focusing on simple, continuous motions between decoupling spatial and temporal factors at both the structure and
adjacent frames, which helps to avoid the complexity of long-range content levels. At the structural level, HiGen splits the T2V task
motion modeling. This approach retains the high-fidelity generation into spatial reasoning, which involves generating spatially coherent
capabilities of pre-trained image diffusion models with minimal priors from text, and temporal reasoning, which creates temporally
modifications and can produce long videos from diverse prompts, coherent motions from these priors using a unified denoiser. On
such as text and images. To address the common issue of drifting in the content side, HiGen extracts cues for motion and appearance
autoregressive models, ART•V incorporates a masked diffusion model changes from input videos to guide training, thereby enhancing
that draws information from reference images rather than relying temporal stability and allowing for flexible content variations.
solely on network predictions, thereby reducing inconsistencies. By Despite its strengths, HiGen faces challenges in generating
conditioning on the initial frame, ART•V enhances global coherence, detailed objects and accurately modeling complex actions due to
which is particularly useful for generating long videos. The framework computational and data quality limitations.
also employs a T2I-Adapter for conditional generation, ensuring high As we have seen in this section, for the generation of video from text,
fidelity with minimal changes to the pre-trained model, matching the the main approaches used are the application of T2I techniques together
inference speed of one-shot models, and supporting larger batch sizes with temporal modules, attention mechanisms, transformers and
during training. In summary, ART•V effectively reduces drifting issues autoencoder. However, in recent years many researchers are focusing
in video generation by incorporating masked diffusion, anchored on diffusion models, which are becoming more and more widely used
conditioning, and noise augmentation to better align training with and are expected to increase in popularity in the coming years.

- 186 -
Regular Issue

2. Image-to-Video Synthesis Text-to-video generation by combining advanced text-to-image


Generating videos from static images poses significant challenges, models. The approach involves using a pre-trained T2I model with
particularly in preserving temporal consistency and achieving realistic added temporal convolution and attention modules, training only the
motion across frames. Despite these difficulties, advancements in temporal layers, and injecting image information at two positions:
image-to-video synthesis have leveraged sophisticated modeling low-level details via VAE encoding and high-level semantics via CLIP
techniques to transform still images into dynamic video sequences. image encoding and cross-attention. Long video frames are predicted
This area has become increasingly important for various applications, iteratively, using initial frames to generate subsequent ones. The
ranging from content creation to enhanced video editing tools. framework is trained using Stable Diffusion 1.5 and a 15M internal
dataset, employing zero terminal SNR and v-prediction techniques for
Recent methods in image-to-video synthesis focus on generating stability. During inference, classifier-free guidance with image and text
high-quality videos by incorporating temporal dynamics into the prompts significantly enhances the stability of the generated output.
transformation process. Techniques like temporal modeling and
attention mechanisms are employed to ensure smooth transitions Other researchers have explored diffusion models for the creation
between frames, thus maintaining coherence and realism in the of videos from images. For example, Shi et al. [126] proposed Motion-
generated videos. A noteworthy contribution to this field is the I2V, a novel framework for consistent and controllable text-guided
work by Wu et al. [122], which introduces LAMP, a few-shot-based image-to-video generation. Unlike previous methods, Motion-I2V
tuning framework for Text-to-video generation, leveraging a first- factorizes the process into two stages with explicit motion modeling.
frame-attention mechanism to transfer information from the initial The first stage involves a diffusion-based motion field predictor to
frame to subsequent ones. This approach, which focuses on fixed deduce pixel trajectories of the reference image. The second stage
motion patterns, is constrained in its ability to generalize across introduces motion-augmented temporal attention to enhance the
diverse scenarios. LAMP utilizes an off-the-shelf text-to-image limited 1-D temporal attention in video latent diffusion models,
model for content generation while emphasizing motion learning effectively propagating reference image features to synthesized
through expanded pre-trained 2D convolution layers and modified frames guided by predicted trajectories. By training a sparse trajectory
attention blocks for temporal-spatial motion learning. A first-frame- ControlNet for the first stage, Motion-I2V enables precise control over
conditioned pipeline ensures high video quality by retaining the motion trajectories and regions, also supporting zero-shot Video-to-
initial frame’s content and applying noise to subsequent frames video translation. Although Motion-I2V provides fine-grained control
during training. During inference, high-quality first frames generated of I2V generation through sparse trajectory guidance, region-specific
by SD-XL enhance video performance. Despite its promise, LAMP animation and zero-shot Video-to-video translation, it is limited in
faces challenges with complex motions and background stability, handling occlusions, brightness uniformity and complex motion.
suggesting areas for future improvement. Guo et al. [123] introduced Expanding on the idea of temporal consistency, Ren et al. [127]
the I2V-Adapter, a lightweight and plug-and-play solution designed proposed ConsistI2V, a diffusion-based method for I2V generation,
for text-guided Image-to-video generation. The key innovation of this designed to enhance visual consistency by using spatiotemporal
adapter lies in its cross-frame attention mechanism, which preserves attention over the first frame to maintain spatial and motion coherence.
the identity of the input image by propagating the unnoised image They introduced FrameInit, an inference-time noise initialization
to subsequent noised frames. This approach ensures compatibility strategy that uses the low-frequency band from the first frame to
with pretrained Text-to-video models, maintaining their weights stabilize video generation, which supports applications such as long
unchanged while seamlessly integrating the adapter. By introducing video generation and camera motion control. The approach leverages
minimal trainable parameters, the I2V-Adapter not only reduces cross-frame attention mechanisms and local window temporal layers
training costs but also ensures smooth compatibility with community- to achieve fine-grained spatial conditioning and temporal smoothness.
driven models and tools. Moreover, the authors incorporated a Frame The ConsistI2V’s architecture, based on a U-Net structure adapted
Similarity Prior, which provides adjustable control coefficients to with temporal layers, employs a latent diffusion model to generate
balance motion amplitude and video stability, thereby enhancing both videos that closely align with the first frame and follow the textual
the controllability and diversity of the generated videos. description. To address motion consistency and efficiency, Shen et
Futhermore, Zhang et al. [124] proposed MoonShot, a video al. [128] proposed a novel approach to Conditional image-to-video
generation model that leverages both image and text as conditional (cI2V) generation by disentangling RGB pixels into spatial content
inputs. MoonShot addresses limitations in controlling visual and temporal motions. Using a 3D-UNet diffusion model, they predict
appearance and geometry by employing the Multimodal video block temporal motions, including motion vectors and residuals, to improve
(MVB) as its core component. This module integrates spatial-temporal consistency and efficiency. The approach begins with Decouple-Based
layers for comprehensive video feature representation and utilizes Video Generation (D-VDM) to predict differences between consecutive
a decoupled cross-attention layer to condition both image and text frames and is further refined with Efficient Decouple-Based Video
inputs effectively. Notably, MoonShot reuses pre-trained weights Generation (ED-VDM), which separates content and temporal
from text-to-image models, allowing for the integration of pre-trained information using motion vectors and residuals extracted via CodeC.
image ControlNet modules to achieve geometry control without The model employs Gaussian noise and a diffusion model to learn the
necessitating additional training. The model’s architecture, which video distribution score and generate a video clip from the initial frame
includes spatial-temporal U-Net layers and decoupled multimodal and text condition. The approach includes Decoupled Video Diffusion
cross-attention layers, ensures high-quality frame generation and Model using DDPM to estimate video distribution scores and a ResNet
temporal consistency. As a result, MoonShot is versatile, supporting bottleneck module to encode the first frame, improving spatial and
tasks like image animation and video editing without the need for temporal representation alignment. Efficient representation is
fine-tuning, while also enabling geometry-controlled generation achieved using I-frames and P-frames, with compression via a Latent
through the effective integration of ControlNet modules. Gong Diffusion autoencoder, optimizing video generation through a learned
et al. [125] proposed AtomoVideo, a high-fidelity Image-to-video joint distribution of motion vectors and residuals.
generation framework that transforms product images into engaging Facing the challenge of maintaining temporal coherence while
promotional videos. AtomoVideo achieves superior motion intensity preserving detailed information about the characters in the image-video
and consistency compared to existing methods and can also perform synthesis for character animation is difficult. Hu et al. [129] proposed

- 187 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

ReferenceNet

VAE Video Clip


Enc

Reference
image

CLIP

Denoising
U-Net

Pose Pose VAE


sequence guider Dec

Spatial-Attention
Cross-Attention Noise
Temporal-Attention Iteratively Denoise

Fig. 8. Animate Anyone [129] architecture.

a novel framework using diffusion models for character animation, see (cINN). To bridge the domain gap between images and videos, they
Fig. 8, addressing the challenges of maintaining temporal consistency introduced a probabilistic residual representation, ensuring that
with detailed character information in image-to-video synthesis. They only complementary information to the initial image is captured.
designed ReferenceNet to merge intricate appearance features from a The method allows sampling and synthesizing novel future video
reference image via spatial attention, and introduced a Pose Guider to progressions from the same start frame. They utilized a separate
ensure controllability and continuity in character movements, along conditional variational encoder-decoder to compute a compact video
with an effective temporal modeling approach for smooth inter-frame representation, facilitating the learning process. Their model captures
transitions. The method extends Stable Diffusion (SD) by reducing the interplay between images and videos, explaining video dynamics
computational complexity through latent space modeling and includes with a single image and residual information, and supports controlled
an autoencoder. The network architecture includes ReferenceNet for video synthesis by incorporating additional factors such as motion
appearance feature extraction, Pose Guider for motion control, and a direction. However, this kind of stochastic video generation can only
temporal layer for continuity of motion. The training strategy consists handle short dynamic patterns in the distribution. Ni et al. [132]
of two stages: first, training on individual video frames without the proposed a method for cI2V generation that synthesizes videos from
temporal layer, and second, introducing and training the temporal layer a single image and a given condition, such as an action label. They
using a 24-frame video clip. Despite its advancements, the model faces introduced Latent flow diffusion models (LFDM), which generate an
limitations in generating stable hand movements, handling unseen optical flow sequence in the latent space to warp the initial image,
parts during character movement, and operational efficiency due to thereby improving the preservation of spatial details and motion
DDPM. Moreover, Xu et al. [130] proposed MagicAnimate, a novel continuity. The method involves a two-stage training process: an
diffusion-based human image animation framework that integrates unsupervised Latent flow auto-encoder (LFAE) to estimate latent
temporal consistency modeling, precise appearance encoding, and optical flow between video frames, and a conditional 3D U-Net-based
temporal video fusion to synthesize temporally consistent human Diffusion model (DM) to produce temporally-coherent latent flow
animation of arbitrary length. They address the challenges of existing sequences based on the image and condition. During inference, the
methods, which struggle with maintaining temporal consistency image is encoded to a latent map, the condition to an embedding, and
and preserving reference identity, by developing a video diffusion the trained DM generates latent flow and occlusion map sequences.
model that encodes temporal information with temporal attention During inference, the image is encoded to a latent map, the condition
blocks and an innovative appearance encoder that retains intricate to an embedding, and the trained DM generates latent flow and
details of the reference image. MagicAnimate employs a simple video occlusion map sequences. These sequences warp the latent map to
fusion technique to ensure smooth transitions in long animations by create a new latent map sequence, which is then decoded into video
averaging overlapping frames. The framework processes animations frames. The proposed method, with its decoupled training strategy
segment-by-segment to manage memory constraints while leveraging and efficient operation in a low-dimensional latent flow space, reduces
a sliding window method to improve transition smoothness and computational cost and complexity while ensuring easy adaptation to
consistency across segments. This comprehensive approach enables new domains.
MagicAnimate to produce high-fidelity, temporally consistent Wang et al. [133] proposed a high-fidelity image-to-video generation
animations that faithfully preserve the appearance of the reference method, named DreamVideo, which addresses issues of low fidelity
image throughout the entire video. and flickering in existing methods by employing a frame retention
In cases where no motion clue is provided, videos are generated branch in a pre-trained video diffusion model. The approach preserves
stochastically, constrained solely by the spatial information in the image details by perceiving the reference image through convolution
input image. Dorkenwald et al. [131] proposed an approach to I2V layers and integrating these features with noisy latents. The model
synthesis by framing it as an invertible domain transfer problem incorporates double-condition classifier-free guidance, allowing a
implemented through a Conditional invertible neural network single image to generate videos of different actions through varying

- 188 -
Regular Issue

prompts, enhancing controllable video generation. DreamVideo’s discovers distinct actions via clustering during the generation
architecture includes a primary T2V model and an Image Retention process, employing an encoder-decoder with a discrete bottleneck
block that infuses image control signals into the U-Net structure. layer to capture frame transitions without needing action label
During inference, the model combines text and image inputs to supervision or a predefined number of actions. The action network
generate contextually consistent videos using CLIP text embeddings estimates action label posterior distributions by decomposing actions
and a U-Net-based generative process. Additionally, the Two-Stage into discrete labels and continuous components, ensuring meaningful
Inference method extends video length and creates varied content by action labels by preventing direct encoding of environment changes
using the final frame of one video as the initial frame for the next, in the variability embeddings.
showcasing the model’s strong image retention and video generation Within the generation of dynamic videos from static images presents
capabilities. Zhang et al. [134] proposed I2VGen-XL, a method utilizing a trend very similar to the previous section, Text-to-Video Synthesis,
two stages of cascaded diffusion models to achieve high semantic where we can see how attention mechanisms, autoencoders and
consistency and spatiotemporal continuity in video synthesis. The diffusion models stand out. As we can see, GANs are not as frequent
approach addresses challenges in semantic accuracy, clarity, and as in synthetic image generation. This approach to video generation
continuity by decoupling semantic and qualitative factors, using static can raise more ethical concerns than the previous one, as it can use
images as guidance. The base stage ensures semantic coherence and images of real people and generate videos that can potentially harm
preserves content at low resolution with two hierarchical encoders—a them; whereas in the previous section, it involves content generated
fixed CLIP encoder for high-level semantics and a learnable content completely from scratch.
encoder for low-level details. The refinement stage enhances video
resolution and refines details using a brief text input and a separate 3. Video-to-Video Synthesis
video diffusion model. Training involves initializing the base model Video-to-video (V2V) synthesis is an advanced field focused on
with pre-trained SD2.1 parameters and moderated updates, while the generating realistic video sequences by transforming or translating
refinement model undergoes high-resolution training and fine-tuning visual information from one video domain to another. The main goal
on high-quality videos. Inference employs a noising-denoising process is to create high-quality, temporally consistent videos that adhere to
and DDIM/DPM-solver++ to generate high-resolution videos from specific input conditions, such as text, pose, style, or semantic maps.
low-resolution outputs. Recent advancements in this area have introduced several techniques
To create more controllable videos, various motion cues like to enhance the quality, efficiency, and consistency of video synthesis,
predefined directions and action labels are used. Blattmann et al. [135] thus pushing the boundaries of what is possible in video generation.
proposed an approach for generating videos from static images by Wang et al. [137] proposed a three-stage framework for human pose
learning natural object dynamics through local pixel manipulations. transfer in videos, focusing on transferring dance poses from a source
Their generative model learns from videos of moving objects without person in one video to a target person in another. The process begins
needing explicit information about physical manipulations and infers with the extraction of frames and pose masks from both source and
object dynamics in response to user interactions, understanding the target videos. Subsequently, a model synthesizes frames of the target
relationships between different object parts. The goal is to predict person in the desired dance pose, followed by a refinement phase to
object deformation over time from a static image and a local pixel enhance the quality of these frames. The model comprises several
shift, using two encoding functions: an object encoder for the current key components, including pose extraction and normalization, a
object state and an interaction encoder for the pixel shift. They utilize a GAN-based synthesis using Cross-domain correspondence network
hierarchical recurrent model to understand complex object dynamics, (CoCosNet), and a coarse-to-fine strategy with two GANs for detailed
predicting a sequence of object states in response to the pixel shift. face reconstruction and smooth frame sequences. Their approach
Object dynamics are modeled using a flexible prediction function based involves visualizing keypoints to create pose skeleton labels, adjusting
on Recurrent Neural Networks (RNN), with higher-order dynamics for differences in body proportions, learning the translation from
captured by introducing a hierarchy of RNN predictors operating pose domain to image domain, and matching features for coherent
on different spatial scales. The decoder generates individual image synthesis. Although their method outperforms existing approaches,
frames from the predicted object states using a hierarchical image-to- it still encounters challenges with large pose variations and domain
sequence UNet structure. Instead of ground-truth interactions, dense generalization, which suggests potential areas for future improvement.
optical flow displacement maps are used to simulate training pokes, Furthermore, Zhuo et al. [138] introduced Fast-Vid2Vid, a spatial-
minimizing the perceptual distance between predicted and actual temporal compression framework designed to reduce computational
video frames. Training involves pretraining the encoders and decoder costs and accelerate inference in Video-to-Video synthesis (Vid2Vid).
to reconstruct image frames, then refining the model to predict object While traditional Vid2Vid generates photorealistic videos from semantic
states and synthesize video sequences. Their interactive I2V synthesis maps, it suffers from high computational costs due to the network
model allows users to specify the desired motion through the manual architecture and sequential data streams. Zhuo et al. addressed this by
poking of a pixel. introducing Motion-aware inference (MAI) to compress the input data
In addition, Menapace et al. [136] proposed a novel framework for stream without altering network parameters and developing Spatial-
the Playable video generation (PVG) task, which generates videos temporal knowledge distillation (STKD) to transfer knowledge from a
from the first frame and a sequence of discrete actions. While the high-resolution teacher model to a low-resolution student model. Their
PVG task reduces user burden by not requiring detailed motion approach incorporates Spatial knowledge distillation (Spatial KD) for
information, it struggles with generating videos involving complex generating high-resolution frames from low-resolution inputs and
motions. An unsupervised learning approach is adopted that allows Temporal knowledge distillation (Temporal KD) to maintain temporal
users to control video generation by selecting discrete actions at coherence in sparse video sequences. Additionally, they utilize a
each time step, similar to video games. The framework, named part-time student generator for sparse frame synthesis and a fast
Clustering for Action Decomposition and DiscoverY (CADDY), learns motion compensation method for interpolating intermediate frames,
semantically consistent actions and generates realistic videos based thereby reducing computational load while maintaining visual quality.
on user input using a self-supervised encoder-decoder architecture Further advancing the field, Yang et al. [139] introduced a zero-shot
driven by a reconstruction loss on the generated video. CADDY text-guided video-to-video translation framework that adapts image

- 189 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

models for video applications. This framework is composed of key Adding to the discussion of temporal consistency, Liang et al.
frame translation and full video translation. Key frames are generated [144] introduced FlowVid, a V2V synthesis framework that ensures
using an adapted diffusion model with hierarchical cross-frame temporal consistency across frames by leveraging spatial conditions
constraints to ensure coherence in shapes, textures, and colors. These and temporal optical flow clues from the source video. Unlike previous
frames are then propagated to the rest of the video using temporal- methods, FlowVid uses optical flow as a supplementary reference to
aware patch matching and frame blending, achieving both global style handle imperfections in flow estimation. The model warps optical
and local texture temporal consistency without requiring re-training flow from the first frame and uses it in a diffusion model, enabling the
or optimization. A key innovation of this approach is the use of optical propagation of edits made to the first frame throughout subsequent
flow for dense cross-frame constraints, ensuring consistency across frames. FlowVid extends the U-Net architecture to include a temporal
different stages of diffusion sampling. However, the method’s reliance dimension and is trained using joint spatial-temporal conditions, such
on accurate optical flow can lead to artifacts if the flow is incorrect, as depth maps and flow-warped videos, to maintain frame consistency.
and significant appearance changes may disrupt temporal consistency, During generation, the model edits the first frame with prevalent
limiting the ability to create unseen content without user intervention. Image-to-image (I2I) models and propagates these edits using a trained
Following the trend of previous researchers but focusing on zero- model, incorporating global color calibration and self-attention feature
shot techniques, Wang et al. [140] presented vid2vid-zero, a zero- integration to preserve structure and motion, thus achieving effective
shot video editing method that leverages pre-trained image diffusion video synthesis with high temporal consistency. In a similar pursuit
models without requiring video-specific training. Their method of enhancing temporal coherence, Wu et al. [145] proposed Fairy, a
introduces a null-text inversion module for text-to-video alignment, a minimalist yet robust adaptation of image-editing diffusion models
cross-frame modeling module for temporal consistency, and a spatial for video editing. Fairy improves temporal consistency and synthesis
regularization module to preserve the fidelity of the original video. fidelity through anchor-based cross-frame attention, which propagates
Vid2vid-zero addresses the issue of flickering in frame-wise image diffusion features across frames. To handle affine transformations,
editing by ensuring temporal consistency through a Spatial-temporal Fairy employs a unique data augmentation strategy, enhancing the
attention (ST-Attn) mechanism, which balances bi-directional temporal model’s equivariance and consistency. The anchor-based model
information and spatial alignment using pre-trained diffusion models. samples K anchor frames to extract and propagate diffusion features,
While effective in video editing tasks, the method’s reliance on pre- ensuring consistency by aligning similar semantic regions across
trained image models limits its capacity to edit actions in videos frames. While Fairy excels in maintaining temporal consistency, its
due to the absence of temporal and motion priors. Expanding on the strong focus on this aspect reduces its accuracy in rendering dynamic
idea of zero-shot video editing, Qi et al. [141] proposed FateZero, a visual effects, such as lightning or flames.
zero-shot text-based editing method for real-world videos that does Lastly, several other methods offer significant contributions to the
not require per-prompt training or user-specific masks. To achieve video-to-video synthesis domain. Ku et al. [146] proposed AnyV2V, see
consistent video editing, FateZero utilizes techniques based on pre- Fig. 9, a training-free video editing framework that simplifies video
trained models, capturing intermediate attention maps during DDIM editing into two steps: editing the first frame with any image editing
inversion to retain structural and motion information and fusing these model and using an image-to-video generation model to create the
maps during editing. A blending mask, derived from cross-attention edited video through temporal feature injection. AnyV2V is compatible
features, minimizes semantic leakage, while the reformed self-attention with various image editing tools, allowing for diverse edits such as style
mechanism in the denoising UNet enhances frame consistency. Despite transfer, subject-driven editing, and identity manipulation, without
its impressive performance, FateZero faces challenges in generating the need for fine-tuning. The framework uses DDIM inversion for
entirely new motions or significantly altering shapes. structural guidance and feature injection to maintain consistency in
Other authors have opted for the use of diffusion models, due to their appearance and motion, enabling accurate and flexible video editing.
performance in similar tasks. Molad et al. [142] proposed Dreamix, a Additionally, it supports long video editing by handling videos beyond
text-driven video editing method that uses a text-conditioned video the training frame lengths of current I2V models, outperforming
diffusion model (VDM). Dreamix preserves the original video’s fidelity existing methods in user evaluations and standard metrics. Ouyang
by initializing with a degraded version of the input video and then fine- et al. [147] introduced I2VEdit, a video editing solution designed to
tuning the model. This mixed fine-tuning technique enhances motion extend the capabilities of image editing tools to videos. This approach
editability by incorporating individual frames with masked temporal achieves this by propagating single-frame edits throughout an entire
attention. Dreamix achieves text-guided video editing by inverting video using a pre-trained Image-to-video model. Notably, I2VEdit
corruptions, downsampling the input video, corrupting it with noise, adapts to the extent of edits, preserving visual and motion integrity
and then upscaling it using cascaded diffusion models aligned with while handling various types of edits, including global, local, and
the text prompt. This approach effectively preserves low-resolution moderate shape changes. The method’s core processes, coarse motion
details while synthesizing high-resolution outputs. Focusing on extraction and appearance refinement, play crucial roles in ensuring
motion guidance, Hu et al. [143] introduced VideoControlNet, a consistency. Coarse motion extraction captures basic motion patterns
motion-guided video-to-video translation framework using a diffusion through a motion LoRA and employs skip-interval cross-attention to
model with ControlNet. Inspired by video codecs, VideoControlNet mitigate quality degradation in long videos.
leverages motion information to maintain content consistency and Meanwhile, appearance refinement uses fine-grained attention
prevent redundant regeneration. The first frame (I-frame) is generated matching for precise adjustments and incorporates Smooth area random
using the diffusion model with ControlNet, mirroring the structure of perturbation (SARP) to enhance inversion sampling. To achieve its
the input frame. Key frames (P-frames) are then generated using the results, I2VEdit segments the source video into clips, processes each clip
motion-guided P-frame generation (MgPG) module, which employs for motion and appearance consistency, and refines appearances using
motion information for consistency and inpaints occluded areas using EDM [99] inversion and attention matching. Building on this, Ouyang
the diffusion model. The remaining frames (B-frames) are efficiently et al. [148] further proposed Content deformation field (CoDeF), a
interpolated using the motion-guided B-frame interpolation (MgBI) novel video representation, emphasizing its application in Video-to-
module. This framework produces high-quality, consistent videos by video translation. CoDeF introduces a canonical content field for static
utilizing advanced inpainting methods alongside motion information. content aggregation and a temporal deformation field for recording

- 190 -
Regular Issue

I2V Model (Frozen)

Original DDIM
video Inversion Original
denoised

First frame CNN features

Spatial-attention

Temporal-attention

Black-Box Image
Editing model

Condition Edited Edited Video


Prompt first frame denoised
Face Image
Subject Image
Style Image
... I2V Model (Frozen)

Fig. 9. AnyV2V [146] framework.

frame transformations. This approach optimizes the reconstruction technique is the diffusion model, as we have seen in other sections of
of videos while preserving essential semantic details, such as object this survey. This is in line with expectations due to all the attention
shapes. In the context of Video-to-video translation, CoDeF employs they are receiving in recent years. However, we can also see that other
ControlNet on the canonical image, which significantly enhances methodologies such as GANs or attention mechanisms are also used.
temporal consistency and texture quality compared to state-of-the- We have also noted that several papers use a zero-shot approach to
art zero-shot video translations using generative models. By avoiding address the problem.
the need for time-intensive inference models, this process becomes
more efficient. The canonical image, optimized through CoDeF, serves
4. Text-Image-to-Video Synthesis
as a basis for applying image algorithms, ensuring consistent effect Text-image-video synthesis (TI2V) is a growing field of research
propagation across the entire video via the temporal deformation field. focused on generating dynamic video content from static images
and text descriptions. Given a single image I and text prompt T, text-
A different approach to video editing with VideoSwap was presented
image-to-video generation aims to synthesize I new frames to yield
by Gu et al. [149], focusing on customized video subject swapping.
a realistic video, I = 〈I0, I1, ..., IM〉 y starting from the given frame I0 and
Unlike methods relying on dense correspondences, VideoSwap utilizes
satisfying the text description T . This field aims to bridge the gap
semantic point correspondences, allowing the replacement of the
between different modalities to create coherent and contextually
main subject in a video with a target subject of a different shape and
accurate videos. Several approaches have been developed to address
identity, all while preserving the original background. The approach
the challenges in this domain, ranging from aligning visual and
includes encoding the source video, applying DDIM inversion, and
textual information to ensuring temporal consistency and control
using semantic points to guide the subject’s motion trajectory. The
over generated content. Hu et al. [151] proposed a novel video
process also involves extracting and embedding semantic points,
generation task called Text-Image-to-Video (TI2V) generation, which
registering these points for motion guidance, and enabling user
creates videos from a static image and a text description, focusing
interactions to refine motion and shape alignment. Recently, Bai et
on controllable appearance and motion. They introduced the Motion
al. [150] proposed UniEdit, a tuning-free framework for video motion
Anchor-based video GEnerator (MAGE) to address key challenges
and appearance editing. This framework leverages a pre-trained Text-
such as aligning appearance and motion from different modalities and
to-video generator in an inversion-then-generation pipeline. UniEdit
handling text description uncertainties. MAGE uses a Motion anchor
addresses content preservation by using temporal and spatial self-
(MA) structure to store aligned appearance-motion representations
attention layers to encode inter-frame and intra-frame dependencies.
and incorporates explicit conditions and implicit randomness to
Additionally, it introduces auxiliary reconstruction and motion-
enhance diversity and control. The framework employs a VQ-VAE
reference branches to inject the desired source and motion features
encoder-decoder architecture for visual token representation and uses
into the main editing path. For content preservation, the auxiliary
three-dimensional axial transformers to recursively generate frames.
reconstruction branch injects attention features into the spatial self-
Training involves a supervised learning approach to approximate the
attention layers. Motion injection, on the other hand, is achieved
conditional distribution of video frames based on the initial image
by guiding the main path with a motion-reference branch during
and text. The motion anchor aligns text-described motion with visual
denoising, utilizing temporal attention maps for alignment with the
features, ensuring consistent and diverse video output through auto-
target prompt. In appearance editing, UniEdit maintains structural
regressive frame generation.
consistency by implementing spatial structure control while omitting
the motion-reference branch. Despite its robust capabilities, UniEdit Complementing this, Guo et al. [152] proposed AnimateDiff, a
faces challenges, particularly when addressing motion and appearance practical framework for animating personalized T2I models without
editing simultaneously. requiring model-specific tuning. The core of the framework is a
plug-and-play motion module, trained to learn transferable motion
In this section we have analyzed the latest research related to video-
priors from real-world videos, which can be integrated into any
to-video synthesis. Within this field we have seen how the most used

- 191 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

personalized T2I model. The training process involves three stages: Recently, Ni et al. [157] proposed TI2V-Zero, a zero-shot, tuning-free
fine-tuning a domain adapter to align with the target video dataset, method for text-conditioned Image-to-video (TI2V) generation that
introducing and optimizing a motion module for motion modeling, leverages a pretrained T2V diffusion model. This approach avoids
and using MotionLoRA, a lightweight fine-tuning technique, to costly training, fine-tuning, or additional modules by using a "repeat-
adapt the pre-trained motion module to new motion patterns with and-slide" strategy to condition video generation on a provided image,
minimal data and training cost. AnimateDiff effectively addresses ensuring temporal continuity through a DDPM inversion strategy and
the problem of animating personalized T2Is while preserving their resampling techniques. The method uses a 3D-UNet-based denoising
visual quality and domain knowledge, demonstrating the adequacy of network and modulates the reverse denoising process to generate
Transformer architecture for modeling motion priors and offering an videos frame-by-frame, preserving visual coherence and consistency,
efficient solution for users who desire specific motion effects without thus enabling the synthesis of long videos while maintaining high
bearing the high costs of pre-training. In contrast, Yin et al. [153] visual quality.
proposed NUWA-XL, a novel "Diffusion over Diffusion" architecture In this section where we have analyzed the techniques to generate
for generating extremely long videos. Unlike traditional methods that videos from static images and textual descriptions, we have seen again
generate videos sequentially, leading to inefficiencies and a training- a main focus, which are the diffusion models, i.e. a trend is observed,
inference gap, NUWA-XL uses a "coarse-to-fine" process where a which seems to show that it will be the most used technique in the
global diffusion model generates keyframes and local models fill in coming years. In addition, we also continue to observe other approaches
between, allowing parallel generation. The architecture incorporates such as attention mechanisms or autoencoders. The greatest danger of
Temporal KLVAE to compress videos into low-dimensional latent this set of techniques, like the previous one, is that they can use images
representations and Mask temporal diffusion (MTD) to handle both of people to create complete videos, which can cause serious damage.
global and local diffusion processes using masked frames. Although However, not all applications of these techniques are negative.
NUWA-XL is currently validated on cartoon data due to the lack of
open-domain long video datasets, it shows promise in overcoming 5. Multi-Modal Video Generation
data challenges and improving efficiency, albeit requiring substantial Multi-Modal Video Generation (MMVG) refers to a versatile field in
GPU resources for parallel inference. which video content is synthesized based on different forms of input,
Esser et al. [154] proposed a structure and content-guided video such as text, images, or existing videos. Although models like Sora and
diffusion model that edits videos based on user descriptions. They Genie can accept various types of input, they typically process one
resolved conflicts between content and structure by training on modality at a time—either generating videos from text descriptions,
monocular depth estimates with varying detail levels and introduced animating static images, or transforming existing video footage. These
a novel guidance method for temporal consistency through joint approaches leverages the strengths of different data modalities to
video and image training. The approach extends latent diffusion produce highly realistic and contextually coherent videos. The core
models to video by incorporating temporal layers into a pre-trained objective of MMVG is to create coherent, high-fidelity, temporal
image model, adding 1D convolutions and self-attentions to residual consistent videos by leveraging the strengths of each input type.
and transformer blocks. The encoder downsamples images to a latent Recent advancements in this field have led to the development of
code, improving efficiency, while depth maps and CLIP embeddings sophisticated models capable of interpreting and synthesizing complex
are used for structure and content conditioning, respectively. This scenes by concurrently analyzing textual descriptions, visual cues,
approach allows full control over temporal, content, and structure and pre-existing video footage. These models push the boundaries of
consistency without requiring per-video training or pre-processing, video generation, offering versatile applications in content creation,
showing improved temporal stability and user preference over entertainment, and beyond.
related methods. Expanding on the concept of control, Yin et al. More recently, OpenAI [6] introduced Sora, a diffusion model that
[155] proposed DragNUWA, an open-domain diffusion-based video represents a significant advancement in T2V generation by training
generation model that integrates text, image, and trajectory inputs a model from scratch rather than fine-tuning pre-trained models.
to provide fine-grained control over video content from semantic, Drawing from transformer architecture scalability, Sora replaces the
spatial, and temporal perspectives. They address the limitations conventional U-Net with a transformer-based structure, effectively
of current methods, which focus on only one type of control managing large-scale video data for complex generative tasks. Sora can
and struggle with complex trajectory handling, by introducing generate high-fidelity videos up to a minute long, maintaining visual
advanced trajectory modeling techniques: a Trajectory sampler quality and narrative consistency across multiple shots. It leverages
(TS) for arbitrary trajectories, Multiscale fusion (MF) for controlling a patch-based approach, turning visual data into spacetime patches,
trajectories at different granularities, and an Adaptive training which enhances its ability to handle videos and images of varying
(AT) strategy for generating consistent videos. DragNUWA can durations, resolutions, and aspect ratios. Sora excels in linguistic
generate realistic and contextually consistent videos by leveraging comprehension, accurately following detailed prompts to generate
the combined inputs of text, images, and trajectories during both coherent video content. However, it faces challenges in rendering
training and inference. realistic interactions and comprehending complex scenes with
Further enhancing controllability, Wang et al. [156] proposed multiple active elements. Despite these limitations, Sora’s capabilities
VideoComposer, a system for enhancing controllability in video in video-to-video editing, image animation, and extending generated
synthesis through the use of temporal conditions like motion vectors. videos mark a significant step toward building general-purpose
They introduced a Spatio-temporal condition encoder (STC-encoder) simulators of the physical world. Bruce et al. [7] introduced Genie, a
to integrate spatial and temporal dependencies, ensuring inter-frame generative interactive environment model trained unsupervised from
consistency. The system decomposes videos into textual, spatial, and unlabelled Internet videos. Genie uses spatiotemporal transformers,
temporal conditions, and uses a latent diffusion model to recompose a novel video tokenizer, and a causal action model to create diverse,
videos based on these inputs. Textual conditions provide coarse- action-controllable virtual worlds from various inputs such as text,
grained visual content, while spatial conditions offer structural and images, and sketches. It generates video frames autoregressively,
stylistic guidance. Temporal conditions, including motion vectors enabling interaction on a frame-by-frame basis without ground-truth
and depth sequences, allow detailed control of temporal dynamics. action labels.

- 192 -
Regular Issue

TABLE IV. Comprehensive Overview of a Few Synthetic Video Generation Techniques

Models Year Technique Target Outcome Data Used Open Source


Make-A-Video [91] 2023 Transformer-based Text-to-video synthesis Various No
Video Diffusion [89] 2023 Diffusion-based High-quality video synthesis Video datasets No
VideoPoet [5] 2023 Transformer-based Generate poetic video narratives Web-collected dataset No
Godiva [104] 2023 GAN-based Generate dynamic video content High-resolution video datasets No
CogVideo [106] 2023 Transformer-based Extend CogView into video Diverse text and video datasets Yes
NUWA [107] 2023 Transformer-based Synthesize coherent video clips Diverse content from web datasets No
NUWA-Infinity [108] 2023 Transformer-based Generate endless video streams Extended NUWA dataset No
VideoGPT [109] 2023 GPT-based Utilize GPT architecture Various video datasets Yes
Video LDMs [110] 2024 Latent Diffusion Models Implement latent space techniques Various No
Text-to-Video (T2V)
2023 Transformer-based Synthesize video from static images Diverse image and video datasets No
[158]
ModelScope Text-to- Large-scale web-collected video
2024 Transformer-based Scalable text-to-video model Yes
Video [113] datasets
W.A.L.T [114] 2023 Diffusion Models Enhance video synthesis Various No
C-ViViT [115] 2023 VAE-based Create detailed videos from categories Category-labeled video datasets No
Text2Video-Zero [118] 2023 Zero-Shot Learning Generate videos without explicit training General video datasets Yes
ART•V [119] 2024 AI Rendered Textures Artistic video creation Artistic style datasets No
BIVDiff [120] 2023 Bi-directional Diffusion Bidirectional control over video generation Various Yes
Simple Diffusion
2024 Diffusion Models Simplify diffusion processes Various Yes
Adapter [116]
HiGen [121] 2024 Hierarchical Generation Layered approach to video scenes Multi-layer video datasets Yes

TABLE V. Overview of Techniques for Detecting AI-Generated Videos

Authors Year Technique Target Outcome Data Used Open Source


Synthetic video detection by
Vahdati et al. [159] 2024 Detect Al-generated synthetic videos Synth-vid-detect No
forensic trace analysis
He et al. [160] 2024 Temporal defects analysis Identify temporal defects in Al-generated videos ExposingAI-Video No
Detail Mamba for spatial-
Chen et al. [162] 2024 Enhance detection of Al-generated videos GenVideo Yes
temporal artifacts detection
Detect Al-generated videos using motion
Bai et al. [163] 2024 Spatio-temporal CNN analysis GVD Yes
discrepancies
Ma et al. [164] 2024 Temporal artifact focus Focus on temporal artifacts in video detection GVF Yes
Integrate motion and visual appearance for fake
Ji et al. [165] 2024 Dual-Branch 3D Transformer GenVidDet No
video detection
Diffusion-generated video Capture spatial and temporal features in RGB
Liu et al. [167] 2024 TOINR No
detection frames and DIRE values

As we can see, this section, multimodal video generation, is the least are markedly different from those produced by image generators. This
explored of all the approaches analyzed, see Table IV, and possibly the issue is not due to the degradation effects of H.264 compression but
most complex, since we not only have to generate the visual part of the rather to the distinct characteristics of video generation. Therefore,
videos, but also the audio. In addition, we must ensure that both are their findings underscore the urgent need for detection methods
matched and do not generate easily detectable artifacts. The techniques tailored specifically to synthetic video content. Table V provides an
analyzed in this field are diffusion models and transformers. Possibly overview of the techniques used for detecting AI-generated videos,
this area will be explored in more detail in the coming years. highlighting key approaches and their application to various datasets.
Despite the growing concerns, research into detecting synthetic
B. Detection of AI-Generated Videos videos has been relatively limited. Video generation technology is still
In the rapidly evolving landscape of Generative AI (Gen AI), in its early stages compared to image generation, and as a result, fewer
significant progress has been made in developing techniques to detect detection methods are available. However, recent efforts have started
AI-generated synthetic images. Given that a video can be viewed as a to address this gap (see Fig. 10).
sequence of images, one might reasonably expect that synthetic image One early approach comes from, He et al. [160] who proposed a novel
detectors would also be effective at identifying AI-generated synthetic detection method for identifying AI-generated videos by analyzing
videos. Surprisingly, Vahdati et al. [159] reveal that current synthetic temporal defects at both local and global levels. The method is based
image detectors fail to reliably detect synthetic videos. Their study on the assumption that AI-generated videos exhibit different temporal
demonstrates that the forensic traces left by synthetic video generators dependencies compared to real videos due to their distinct capturing

- 193 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

Local and global defects He et al. [160]


Temporal
Artifacts
Temporal artifacts,
DeCoF [164]
excluding spatial artifacts

Al-Generated Videos Detection


Spatial-temporal
DeMamba [162]
inconsistencies
Spatial and
Temporal Artifacts Spatial and temporal features DIVID [167]

Spatial anomalies and


AIGVDet [163]
temporal inconsistencies

Motion Motion with


DuB3D [165]
Discrepancies visual appearance

Fig. 10. AI-Generated videos detection methods overview.

and generation processes. Real videos, which are captured by cameras, ResNet sub-detectors to identify anomalies in the spatial and optical
have high temporal redundancy, whereas AI-generated videos control flow domains. The spatial detector examines the abnormality of
frame continuity in the latent space, leading to defects at different spatial pixel distributions within single RGB frames, while the optical
spatio-temporal scales. To address local motion information, the flow detector captures temporal inconsistencies via optical flow. The
method uses a frame predictor trained on real videos to measure inter- model uses RGB frames and optical flow maps as inputs, with the
frame motion predictability. Fake videos show larger prediction errors two-branch ResNet50 encoder detecting abnormalities and a decision-
because they have less temporal redundancy. Temporal aggregation is level fusion binary classifier combining this information for the final
employed to maintain long-range information and reduce the impact prediction. AIGVDet effectively leverages motion discrepancies for
of diverse spatio-temporal details. The aggregated error map is then comprehensive spatio-temporal analysis to detect AI-generated
processed by a 2D encoder to obtain local motion features. For global videos. Ma et al. [164] found that detectors based on spatial artifacts
appearance variation, the method extracts visual features using a lack generalizability. Hence, they proposed DeCoF, a detection model
pre-trained BEiT v2 [161] image encoder. These features are fed into that focuses on temporal artifacts and eliminates the impact of spatial
a transformer to model temporal variations, identifying abnormal artifacts during feature learning. DeCoF is the first method to use
appearance changes across frames. Finally, a channel attention-based temporal artifacts by decoupling them from spatial artifacts, mapping
fusion module combines the local motion and global appearance video frames to a feature space where inter-feature distance is inversely
features to enhance detection reliability. This module adjusts channel correlated with image similarity, and detecting anomalies from inter-
significance to extract more generalized forensic clues. frame inconsistency. The method reduces computational complexity
Furthermore, Chen et al. [162] proposed a plug-and-play module and memory requirements, needing only to learn anomalies between
named Detail Mamba (DeMamba), designed to enhance the detection features. However, DeCoF may experience significant performance
of AI-generated videos by identifying spatial and temporal artifacts. degradation or be inapplicable in the face of tampered video, such as
DeMamba builds upon the Mamba framework to explore both Deepfake and malicious editing.
local and global spatial-temporal inconsistencies, addressing the Traditional video detection models often overlook specific
limitation of models that consider only one aspect, either spatial or characteristics of downstream tasks, particularly in fake video
temporal. Using vision encoders like CLIP and XCLIP, it encodes detection where motion discrepancies between real and generated
video frames into a sequence of features, groups them spatially, and videos are significant, as generators tend to excel in appearance
applies the DeMamba module to model intra-group consistency. modeling but struggle with accurate motion representation. Ji et al.
Aggregated features from different groups help determine video [165] proposed the Dual-Branch 3D Transformer (DuB3D) to address
authenticity. The DeMamba module introduces a novel approach to this issue by integrating motion information with visual appearance
spatial consolidation by splitting features into zones along height using a dual-branch architecture that fuses raw spatio-temporal data
and width, performing a 3D scan for spatial-temporal input. Unlike and optical flow. The spatial-temporal branch processes original
previous mechanisms, DeMamba’s continuous scan aligns spatial frames to capture spatial-temporal information and identify anomalies,
tokens sequentially, enhancing the model’s ability to capture complex while the optical flow branch uses GMFlow [166] to estimate and
relationships. For classification, DeMamba averages input features to capture motion information, and these features are combined using
obtain global features and pools processed features into local features, a Multi-layer perceptron (MLP) for classification. Built on the Video
concatenating them with the global ones for classification via a simple Swin Transformer backbone, DuB3D effectively enhances fake video
MLP, ensuring robust video authenticity detection. detection by emphasizing motion modeling and demonstrating strong
Based on the assumption that low-quality videos show abnormal generalization across various video types. More recently, Liu et al. [167]
textures and physical rule violations, while high-quality videos, proposed a novel approach for DIffusion-generated VIdeo Detection
indistinguishable to the naked eye, often manifest temporal (DIVID). DIVID uses CNN+LSTM architectures to capture both spatial
discontinuities in optical flow maps, Bai et al. [163] proposed and temporal features in RGB frames and DIRE values. Initially, the
an effective AI-generated video detection (AIGVDet) scheme CNN is fine-tuned on original RGB frames and DIRE values, followed
by capturing forensic traces with a two-branch spatio-temporal by training the LSTM network based on the CNN’s feature extraction.
Convolutional Neural Network (CNN). This scheme employs two This two-phase training enhances detection accuracy for both in-

- 194 -
Regular Issue

TABLE VI. AI-Generated Image Detection Datasets

Dataset Year Content Real Source Generator #Real #Generated Available


LSUN Bed [168] 2022 Bedroom LSUN GAN/DM 420,000 510,000 
DFF [169] 2023 Face IMDB-WIKI DM 30,000 90,000 
RealFaces [170] 2023 Face - DM - 25,800 
DiffusionForensics [79] 2023 General LSUN ImageNet DM 134,000 481,200 
Synthbuster [76] 2023 General Raise-1k DM - 9,000 
DDDB [171] 2023 Art LAION-5B DM 64,479 73,411 
MSCOCO
DE-FAKE [172] 2023 General DM - 191 946 
Flickr30k
AI-Gen [173] 2023 General ALASKA DM 20,000 40,000 
Various sources including AFHQ,
ArtiFact [174] 2023 General GAN/DM 964,989 1,531,749 
CelebAHQ, COCO, etc.
AutoSplice [175] 2023 General Visual News DM 2,273 3,621 
Various sources including AFHQ, CelebAHQ, LSUN,
HiFi-IFDL [176] 2023 General GAN/DM ~ 600,000 1,300,000 
Youtube face etc.
M3DSYNTH [177] 2023 CT LIDC-IDRI GAN/DM 1,018 8,577 
DIF [74] 2023 General Laion-5B GAN/DM 168,600 168,600 
News: The Guardian, BBC,
DGM4 [178] 2023 General GAN/DM 77,426 152,574 
USA TODAY, Washington Post
COCOFake [179] 2023 General COCO DM ~ 1,200,000 ~ 1,200,000 
DiFF [180] 2024 Face VoxCeleb2 CelebA DM 23,661 537,466 
CIFAKE [181] 2024 General CIFAR-10 DM 60,000 60,000 
GenImage [182] 2024 General ImageNet GAN/DM 1,331,167 1,350,000 
Fake2M [183] 2024 General CC3M GAN/DM - 2,300,000 
WildFake [184] 2024 General Various sources including COCO, FFHQ Laion-5B, etc. GAN/DM 1,013,446 2,680,867 

domain and out-domain videos. Diffusion Reconstruction Error techniques and models. This will allow the development of robust
(DIRE) is calculated as the absolute difference between an original models capable of being applied in real situations.
image and its reconstructed version from a pre-trained diffusion
model, capturing signals of diffusion-generated images. By training A. Image Datasets
the CNN+LSTM with DIRE and RGB frame features, DIVID improves In this section, we highlight some of the key image datasets that
detection accuracy for AI-generated videos. have significantly contributed to state-of-the-art AI-generated
Detecting AI-generated videos is an emerging challenge, distinct imagery. These datasets not only differ in size and content but
from synthetic image detection due to unique forensic traces in video also cater to various research needs, from general-purpose image
content. While promising methods have begun to address this gap, generation to specialized tasks like AI-generated images detection
leveraging spatio-temporal analysis and novel fusion techniques, the and multimodal learning. For a detailed comparison, refer to Table VI,
field is still evolving, see Table V. Continued innovation is essential to which summarizes the features and scope of these datasets.
stay ahead of rapidly advancing video generation technologies. Conceptual Captions 12M (CC12M) [185] is a large-scale dataset
of 12.4 million image-text pairs derived from the Conceptual Captions
3M (CC3M) dataset [186]. CC12M was created by relaxing some of the
V. Datasets
filters used in CC3M to increase the recall of potentially useful image-
One of the most important aspects of DL model development is alt-text pairs. The relaxed filters allow for more diverse and extensive
the availability of quality datasets. These datasets have to have some data, though this results in a slight drop in precision. Unlike CC3M,
fundamental properties to be able to create robust models: to be CC12M does not perform hypernymization or digit substitution,
representative, intra-class variability, balance between classes and a except for substituting person names to protect privacy. This dataset’s
minimum quality. This will allow us to create suitable new generative larger scale and diversity make it well-suited for vision-and-language
and detection models. In this section we will focus on image and video pre-training tasks.
datasets generated with AI. WIT [187] introduced to facilitate multimodal, multilingual
The development of AI-generated images relies heavily on the learning, contains 37.5 million entity-rich image-text examples and
availability of diverse and comprehensive datasets. These datasets 11.5 million unique images across 108 Wikipedia languages. It serves
provide the essential training material for models to learn from, as a pre-training dataset for multimodal models, particularly useful
enabling them to generate realistic and varied images. Ranging from for tasks like image-text retrieval. WIT stands out due to its large size,
large-scale collections of image-text pairs to datasets specifically multilingual nature with over 100 languages, diverse concepts, and a
designed for detecting synthetic content, these resources play a challenging real-world test set. It combines high-quality image-text
pivotal role in advancing the field. Regarding detection, we need pairs from curated datasets like Flickr30K and MS-COCO with the
representative and varied datasets that include different generation scalability of extractive datasets. WIT’s creation involved filtering

- 195 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

low-information associations and ensuring image quality. The dataset RealFaces [170] consists of 25,800 images generated using Stable
provides multiple text types per image (reference, attribution, and alt- Diffusion, incorporating prompts for photorealistic human faces.
text), offers extensive cross-lingual text pairs, and supports contextual It includes 431 images filtered by an NSFW filter, mainly depicting
understanding with 120 million contextual texts. women and young people.
RedCaps [188] is a large-scale dataset introduced in 2021, consisting Deepart Detection Database (DDDB) [171] is designed for
of 12 million image-text pairs collected from Reddit. This dataset detecting deepfake art. It includes high-quality conventional art
includes images and captions depicting a variety of objects and scenes, from LAION-5B and deepfake art from models like Stable Diffusion,
sourced from a manually curated set of subreddits to ensure diverse DALL-E 2, Imagen, Midjourney, and Parti. Conart images are sourced
yet focused content. The data collection process involves three steps: from LAION-5B, while deeparts are generated using state-of-the-
subreddit selection, image post filtering, and caption cleaning. Images art models or collected from social media. DDDB consists of 64,479
are primarily photographs from 350 selected subreddits, excluding any conventional art images (conart) and 73,411 deepfake art images
NSFW, banned, or quarantined content. Filtering techniques are used (deepart). It supports research in deepart detection, continuously
to maintain high-quality captions and mitigate privacy and harmful updating to incorporate new deeparts and addressing privacy and
stereotypes, resulting in a robust and extensive dataset. storage constraints.
Laion-5b [189] is a large-scale vision-language dataset derived SynthBuster [76]. Due to the scarcity of diffusion model-
from Common Crawl, containing nearly 6 billion image-text pairs. generated images, SynthBuster addresses this by providing a new
Images with alt-text were extracted and processed to remove low- dataset with images from models like Stable Diffusion 1.3, 1.4, 2, and
quality and malicious content. Filtering based on cosine similarity with XL, Midjourney, Adobe Firefly, and DALL·E 2 and 3. While synthetic
OpenAI’s ViT-B/32 CLIP model reduced the dataset size significantly. images are generated from text, SynthBuster uses the existing Raise-
The dataset is divided into three subsets: 2.32 billion English pairs, 1k database of real images, which is a varied subset of the Raise [192]
2.26 billion multilingual pairs, and 1.27 billion pairs with undetected dataset, as a guideline for the generated image. Original images are not
languages. Metadata includes image URLs, text, dimensions, similarity used as prompts to try to recreate or modify a similar image. They are
scores, and NSFW tags. only used as a guideline to create the new prompt for the presentation,
DiffusionDB [190] is the first large-scale prompt dataset totaling to ensure that the resulting image is broadly in the same category
6.5TB, containing 14 million images generated by Stable Diffusion as the original image. For each of the 1000 images, descriptions are
using 1.8 million unique prompts. Constructed by collecting images generated using the Midjourney descriptor [3] and CLIP Interrogator
shared on the Stable Diffusion public Discord server. Most prompts [193]. Then, these descriptions were used as the basis for manually
are between 6 to 12 tokens long, with a significant spike at 75 tokens, writing a text prompt to generate a photo-realistic image loosely based
indicating many users exceed the model’s limit. 98.3% of the prompts on the original image.
are in English, with the rest covering 34 other languages. DiffusionDB DE-FAKE [172] is designed for detecting AI-generated images.
provides unique research opportunities in prompt engineering, Real images are sourced from the MSCOCO and Flickr30k datasets.
explaining large generative models, and detecting deepfakes, serving To create a corresponding set of fake images, prompts from these real
as an important resource for studying prompts in text-to-image images were used to generate 191,946 synthetic images through four
generation and designing next-generation human-AI interaction tools. different image generation models: Stable Diffusion, Latent Diffusion,
DiffusionForensics [79] is a dataset designed for evaluating GLIDE, and DALLE-2.
diffusion-generated image detectors. It includes 42,000 real images from AI-Gen [173] dataset consists of 20,000 uncompressed 256 × 256
LSUN-Bedroom, 50,000 from ImageNet, and 42,000 from CelebA-HQ. PG images from the ALASKA [194] database, which are used to
Generated images are produced by various models, with unconditional construct the T2I dataset. Specific spots and objects are extracted
models like ADM, DDPM, iDDPM, and PNDM generating 42,000 from these Photographs (PG) images, and 5,000 prompts are generated
images each from LSUN-Bedroom. Text-to-image models LDM, SD- with ChatGPT. Two AI systems, DALL·E2 [195] and DreamStudio,
v1, SD-v2, and VQ-Diffusion also generate 42,000 images each, while are used to generate four images per prompt, creating two databases:
IF, DALLE-2, and Midjourney produce fewer images. For ImageNet, DALL·E2 [195] and DreamStudio [196]. Each database contains 20,000
50,000 images each are generated by a conditional model ADM and Photographs (PG) images and corresponding T2I images. The images
a text-to-image model SD-v1. CelebA-HQ includes 42,000 images are resized to 256 × 256, 128 × 128, and 64 × 64, and JPEG compression
generated by SD-v2 and smaller sets by IF, DALLE-2, and Midjourney. is applied with a quality factor between 75 and 95. The datasets are
LSUN Bedroom [168] dataset contains images center-cropped to divided into training (12,000 pairs), validation (3,000 pairs), and testing
256×256 pixels. Samples are either downloaded or generated using (5,000 pairs).
code and pre-trained models from original publications. The dataset AutoSplice [175] is a image dataset containing 5,894 manipulated
includes samples from ten models (e.g. ProGAN, Diff-StyleGAN2, Diff- and authentic images, designed to aid in developing generalized
ProjectedGAN, DDPM, IDDPM,LDM). For each model, 51,000 images detection methods. The dataset consists of 3,621 images generated
were sampled, and the real part is sourced from lsun bedroom dataset by locally or globally manipulating real-world image-caption pairs
[191]. from the Visual News dataset. The DALL-E2 generative model was
DeepFakeFace (DFF) [169] is a dataset designed to evaluate used to create synthetic images based on text inputs. AutoSplice
deepfake detectors, featuring 120,000 images, with 30,000 real images construction involved pre-processing with object detection and text
sourced from the IMDB-WIKI dataset and 90,000 fake images. To parsing, human annotations to select and modify object descriptions,
generate these fake images, three models were used: Stable Diffusion and post-processing to filter out images with visual artifacts. The final
v1.5, Stable Diffusion Inpainting, and InsightFace, each producing dataset includes 3,621 high-quality manipulated images and 2,273
30,000 images. The dataset includes high-resolution images of 512 authentic images, with versions in both lossless and gently lossy JPEG
× 512 pixels. Real images were matched by gender and age, using compression formats.
prompts like "name, celebrity, age" for generation. Discrepancies in ArtiFact [174] is a large-scale dataset designed to evaluate the
facial bounding boxes were corrected using the RetinaFace detector to generalizability and robustness of synthetic image detectors by
ensure accuracy before generating deepfakes. incorporating diverse generators, object categories, and real-world

- 196 -
Regular Issue

impairments. It includes 2,496,738 images, with 964,989 real and B. Video Datasets
1,531,749 fake images. The dataset covers multiple categories such In this section, we review key video datasets that have been pivotal
as Human/Human Faces, Animal/Animal Faces, Places, Vehicles, and in advancing state-of-the-art AI models. These resources Offer diverse
Art, sourced from 8 source datasets (e.g., COCO, ImageNet, AFHQ, video-text pairs, high-resolution clips, and specialized content, each
Landscape) . It features images synthesized by 25 distinct methods, contributing uniquely to the progress of Al-driven video technology.
including 13 GANs (e.g., StyleGAN3, StyleGAN2, ProGAN), 7 Diffusion For a detailed comparison, refer to Table VII, which summarizes the
models (e.g., DDPM, Latent Diffusion, LaMA), and 5 other generators characteristics and scope of these datasets.
(e.g., CIPS, Palette). To ensure real-world applicability, images undergo
impairments like random cropping, resizing, and JPEG compression YT-Tem-180M [198] was collected from 6 million public YouTube
according to IEEE VIP Cup 2022 standards. videos, totaling 180 million clips, and annotated by ASR. It includes
diverse content such as instructional lifestyle vlogs, and auto-
CIFAKE [181] consists of 120,000 images, split evenly between real suggested videos on topics. Videos were filtered to exclude those an
and synthetic images. The real images are taken from the CIFAR-10 English ASR track, over 20 minutes long, in 'ungrounded" categories, or
[197] dataset, comprising 60,000 32x32 RGB images across ten classes: with thumbnails to contain objects. Each video was split into segments
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck, of an image frame and corresponding spoken words, resulting in 180
with 50,000 images used for training and 10,000 for testing. The million segments.
synthetic images are generated using the CompVis Stable Diffusion
model (version 1.4), which is trained on subsets of the LAION-5B WebVid-2M [199] is a large-scale video-text pretraining dataset
[189] dataset. The generation process involves reverse diffusion from consisting of 2.5 million video-text pairs. The average length of each
noise to create 6,000 images per class, mimicking the CIFAR-10 [197] video is 18.0 seconds, and the average caption length is 12.0 words.
dataset. Similar to the real images, 50,000 synthetic images are used for The raw descriptions for each video are collected from the Alt-text
training and 10,000 for testing, with labels indicating their synthetic HTML attribute associated with web images. This dataset was scraped
nature. from the web using a method similar to Google Conceptual Captions
(CC3M), which includes over 10% of images that are video thumbnails.
GenImage [182] is designed to evaluate detectors’ ability to WebVid-2M captions are manually generated, well-formed sentences
distinguish between AI-generated and real images. It includes 2,681,167 aligned with the video content, contrasting with the HowTo100M
images, with 1,331,167 real images from ImageNet and 1,350,000 fake [105] dataset, which contains incomplete sentences from continuous
images generated using eight models: BigGAN, GLIDE, VQDM, Stable narration that may not be temporally aligned with the video.
Diffusion V1.4, Stable Diffusion V1.5, ADM, Midjourney, and Wukong.
The images are balanced across ImageNet’s 1000 classes, with specific CATER-GEN-v1 [151] is a synthetic dataset set in a 3D
allocations for training and testing. Each model generates a nearly environment, derived from CATER [210], featuring two objects (cone
equal number of images per class, ensuring no overlap in real images. and snitch) and a large table plane. It includes four atomic actions:
The dataset features high variability and realism, particularly in "rotate", "contain", "pick-place", and "slide", with each video containing
animals and plants, providing a robust basis for developing detection one or two actions. Descriptions are generated using predefined
models. templates, with a resolution of 256x256 pixels. The dataset includes
3,500 training pairs and 1,500 testing pairs.
Fake2M [183] dataset is a large-scale collection of over 2 million
AI-generated images. These images are created using three different CATER-GEN-v2 [151] is a more complex version of CATER-
models: Stable Diffusion v1.5, IF, and StyleGAN3. The dataset aims to GEN-v1, containing 3 to 8 objects per video, each with randomly chosen
investigate whether models can distinguish AI-generated images from attributes from five shapes, three sizes, nine colors, and two materials.
real ones. The actions are the same as in CATER-GEN-v1, but descriptions are
designed to create ambiguity by omitting certain attributes. The video
DiFF [180] comprises over 500,000 images synthesized using resolution is 256x256 pixels, and the dataset includes 24,000 training
thirteen distinct generation methods under four conditions, leveraging pairs and 6,000 testing pairs.
30,000 textual and visual prompts to ensure high fidelity and semantic
consistency. The dataset includes pristine images from 1,070 celebrities, Internvid [202] is a video-centric multimodal dataset created
curated from sources like VoxCeleb2 and CelebA, totaling 23,661 for large-scale video-language learning, featuring high temporal
images. Prompts, derived from these pristine images, include original dynamics, diverse semantics, and strong video-text correlations. It
and modified textual prompts as well as visual prompts. The dataset includes 7 million YouTube video-text correlations. It includes 7 million
covers four categories of diffusion models: Text-to-Image (T2I), Image- YouTube videos with an average duration of 6.4 minutes, covering
to-Image (I2I), Face Swapping (FS), and Face Editing (FE), employing 16 topics. Videos were collected based on popularity and action-
methods like Midjourney, Stable Diffusion XL, DreamBooth, DiffFace, related queries, ensuring diversity by including various countries
and others to generate the forged images. and languages. Each video is segmented into clips, resulting in 234
million clips from 2s to more than 30s duration, which were captioned
WildFake [184] is designed to assess the generalizability and using a multiscale method focusing on common objects and actions.
robustness of fake image detectors. Developed with diverse content InternVid emphasizes high resolution, with 85% of videos at 720P, and
from open-source websites and generative models, it provides a provides comprehensive multimodal data including audio, metadata,
comprehensive set of high-quality fake images. It includes images and subtitles. The dataset is notable for its action-oriented content,
from DMs, GANs, and other generators, with categories such as containing significantly more verbs compared to other datasets, and
"Early" and "Latest" models. The dataset also features nine kinds of includes 7.1 million interleaved video-text data pairs for in-context
DMs generators and various fine-tuning strategies for SD-based learning.
generators. Images were collected using a generation pipeline from
platforms like Civitai and Midjourney, ensuring a representative FlintstonesHD [153] is a densely annotated long video dataset
sample of real-world quality. Real images were sourced from datasets created to promote the development of long video generation. The
like COCO, FFHQ, and Laion-5B. WildFake contains 3,694,313 images, dataset is built from the original Flintstones cartoon, containing
with 1,013,446 real and 2,680,867 fake images, split into training and 166 episodes with an average of 38,000 frames per episode, each at
testing sets in a 4:1 ratio. a resolution of 1440 × 1080 pixels. Unlike existing video datasets,
FlintstonesHD addresses issues such as short video lengths, low

- 197 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

TABLE VII. Video Datasets. Datasets With Grey Background Are Used in A AI-generated Videos Detection

Avg len Duration


Dataset Year Source Size Domain Resolution Text Unique Features
(sec) (hrs)
Filters to exclude non-English
YT-Tem-180M YouTube, 180M Videos
2021 Open - ASR - - ASR and visuall «ungrounded»
[198] HowTo100M 180M Text
categories
2.5M Videos Manually generated captions,
WebVid-2M [199] 2021 Web
2.5M Text
Open 360p Manual 18.0 13K
aligned with video content
WebVid-10M 10M Videos Manually generated captions,
2021 Web Open 360p Alt-Text 18.0 52K
[199] 10M Text aligned with video content
CATER-GEN-v1 5K Video Predefined Synthetic, simple scenes with
2022 Synthetic 3D objects Geometric 256p - -
[151] 5K Text template atomic actions
CATER-GEN-v2 30K Video Predefined Increased complexity with more
2022 Synthetic 3D objects Geometric 256p - -
[151] 30K Text Template objects and attributes
35,666 High-quality, detailed text
CelebV-HQ [200] 2022 Web Face 512p Manual 3 to 20 65
Videos descriptions
HD-VILA-100M 103M Videos High-quality alignment of
2022 YouTube Open 720p ASR 13.4 371.5K
[201] 103M Text videos and transcriptions
360p Action-oriented, diverse
7.1M Videos
Internvid [202] 2023 YouTube Open 512p Generated 11.7 760.3K languages, and high video-text
234M clips
720p correlation
FlintstonesHD Densely annotated for long
2023 Flintstones cartoon 166 episodes Cartoon 1440x1080 Generated - -
[153] video generation
70K Videos Semi-Auto High-quality, detailed text
Celebv-text [203] 2023 Web Face 512p+ <5s 279
1.4M Text Generated descriptions
HD-VG-130M 130M Videos High-definition, single-scene
2023 YouTube Open 720p Generated ~ 5.1 184K
[204] 130M Text clips
Youku-mPLUG 10M Videos Focused on advancing Chinese
2023 Youku platform Open - - 54.2 150K
[205] 10M Text multimodal LLMs
1.67M
Extensive prompts with
VidProM [206] 2024 Pika Discord prompts Open - Manual -
semantic uniqueness
6.69M Videos
YouTube, Videvo,
High visual quality, detailed
MiraData [207] 2024 Pixabay, Pexels HD- Open 720p Generated 72.1 16K
captions
VILA-100M

Kinetics-400 Youku-
~ 2.31M Balance of real and fake videos
GenVideo [162] 2024 mPLUG MSR-VTT Open - Automatic 2 to 6
Videos across diverse scenes
Video Gen Methods

ExposingAI- MSVD, Potat1 Ali- H. 265 compression and quality


2024 2K Videos Open - Automatic - -
Video [160] vilab,ZScope T2V-zero degradation simulation

Synth-vid-detect MIT, Video-ACID Gen H. 265 compression Out-of-


2024 18.75K Videos Open - Automatic - -
[159] Video Methods distribution test set
GOT, Youtube_vos2 Collection from various SOTA
GVD [163] 2024 - Open - Automatic - -
Gen Video Methods models
MSVD, MSR-VTT 964 Videos Diversity in forgery targets,
GVF [164] 2024 Open - Automatic - -
Gen Video Methods 964 Text scenes, and behaviors
InternVid, HD-VG- 256p
~2.66M Large-scale dataset cover diverse
GenVidDet [165] 2024 130M Gen Video Open 512p Automatic - 4442
Videos content
Methods 720p
VidVRD, SVD-XT
~2.826K Out-domain testing with various
TOINR [167] 2024 YouTube Open - Automatic - -
Videos generation tools
SORA, Pika, GEN-2
High-quality captions with
70.8M Videos
Panda-70m [208] 2024 HD-VILA-100M Open 720p Automatic 8.5s 166.8K significant improvements in
70.8M Text
downstream tasks
Comprehensive with vision,
27M Videos 5 to 30
VAST-27M [209] 2024 HD-VILA-100M Open - Generated - audio, and omni-modality
297M Text sec
captions

- 198 -
Regular Issue

resolution, and coarse annotations. The image captioning model range from 2 to 6 seconds, with diverse aspect ratios. Real videos are
GIT2 [211] was used to generate dense captions for each frame, with sourced from datasets like Kinetics-400, Youku-mPLUG, and MSR-
manual filtering to correct errors, thus providing detailed annotations VTT [213]. Fake videos are generated using diffusion-based models,
that capture movement and story nuances. This dataset serves as a auto-regressive models, and other methods such as VideoPoet, Emu,
benchmark for improving long video generation. Sora, VideoCrafter, latent flow diffusion models, masked generative
Celebv-text [203] is a large-scale facial text-video dataset aimed video transformer, and autoregressive models. Additionally, sources
at providing high-quality video samples with relevant, diverse text include external web scraping and service-based methods like the Pika
descriptions. Constructed through data collection and processing, website. This diverse and comprehensive collection aims to enhance
data annotation, and semi-auto text generation, it features 70, 000 the understanding and detection of AI-generated videos across
video clips totaling around 279 hours. Videos were sourced from the numerous real-world contexts.
internet, using queries like human names and movie titles, excluding ExposingAI-Video [160] is composed of 1,000 natural videos
low-resolution and short clips, and processed to maintain high sourced from the MSVD [214] dataset, paired with 1,000 fake videos
quality without upsampling or downsampling. Annotations include generated using four advanced diffusion-based video generators,
static attributes like general appearance and light conditions, and resulting in 96,000 fake frames. The dataset offers diverse content
dynamic attributes like actions and emotions, with both automatic driven by text prompts, featuring rich motion information distinct from
and manual methods used for accuracy. Texts were generated using static images. It includes videos generated by models such as ali-vilab,
a combination of manual descriptions and auto-generated templates zeroscope, potat1, and a zero-shot text-to-video model, each providing
based on common grammar structures, resulting in longer and more unique configurations. Additionally, the dataset incorporates three
detailed text descriptions compared to other datasets. CelebV-Text video post-processing operations—H.265 ABR compression, H.265
surpasses existing datasets like MM-Vox [212] and CelebV-HQ [200] in CRF compression, and Bit Error—to simulate quality degradation for
scale, resolution, and text-video relevance, offering a comprehensive robustness evaluation.
resource for facial video analysis. Synth-vid-detect [159] consists of both real and synthetic videos
VidProM [206] is a large-scale dataset for text-to-video diffusion for training and evaluation. It includes 7,654 real videos for training,
models, collected from Pika Discord channels between July 2023 and 784 for validation, and 1,661 for testing, sourced from the Moments
February 2024.It includes 1,672,243 unique text-to-video prompts, in Time (MIT) [215] and Video-ACID [216] datasets. The synthetic
embedded with 3072-dimensional embeddings using OpenAI’s text- videos, totaling 6,197 for training, 624 for validation, and 1,429 for
embedding-3-large API. The dataset includes NSFW probabilities testing, were generated using Luma, VideoCrafter-v1, CogVideo, and
assigned using the Detoxify model, with less than 0.5% of prompts Stable Video Diffusion, with diverse scenes and activities represented.
flagged as potentially unsafe. It features 6.69 million videos generated All videos were compressed using H.264 at a constant rate factor of 23.
by Pika, VideoCraft2, Text2Video-Zero, and ModelScope, involving For testing, an exclusive set of prompts and videos was used to avoid
significant computational resources. After filtering for semantic overlap with the training data. Additionally, the dataset includes an
uniqueness, VidProM retains 1,038,805 unique prompts. Compared to out-of-distribution, test-only set of 401 synthetic videos generated by
DiffusionDB, VidProM has 40.6% more semantically unique prompts Sora, Pika, and VideoCrafter-v2.
and supports longer, more complex prompts due to its advanced Generated Video Dataset (GVD) [163] includes 11,618 video
embedding model. VidProM includes videos generated by four state- samples produced by 11 different state-of-the-art generator models.
of-the-art models, resulting in over 14 million seconds of video content. These models generate videos using either T2V or I2V techniques. The
VidProM’s extensive video content and complex prompts, requiring dataset was primarily collected from the Discord platform, where users
dynamic and temporal descriptions, make it a valuable resource for share videos generated by various models. For training and validation,
developing text-to-video generative models. 550 T2V-generated videos from Moonvalley [217] and 550 real videos
MiraData [207] is a large-scale text-video dataset with long from the YouTube_vos2 [218] dataset were used. All generated videos
durations and detailed structured captions. The dataset, finalized not used in training and validation are designated for testing, with real
through a five-step process, sources videos from YouTube, Videvo, test videos sourced from the GOT [219] dataset.
Pixabay, and Pexels to ensure diverse content and high visual quality. GeneratedVideoForensics (GVF) [164] dataset consists of 964
From YouTube, 156 high-quality channels were selected, resulting in triples, each containing a real video, a corresponding text prompt, and
68K videos and 173K clips post-processing. Additional videos were a video generated by one of four different open-source text-to-video
sourced from HD-VILA-100M, Videvo (63K), Pixabay (43K), and Pexels generation models: Text2Video-zero, ModelScopeT2V, ZeroScope,
(318K). Video clips were split and stitched using models like Qwen- and Show-1. These models cover various forgery targets, scenes,
VL-Chat and DINOv2, ensuring semantic coherence and content behaviors, and actions, ensuring the dataset’s diversity. The real
continuity. MiraData provides five versions of filtered data based on videos and prompts were collected from MSVD [214] and MSR-VTT
video color, aesthetic quality, motion strength, and NSFW content, [213] datasets, with a focus on simulating realistic video distributions
with 788K to 9K clips. Captions were generated using GPT-4V, across spatial and temporal dimensions. It also includes vidoes from
resulting in dense and structured descriptions with average lengths of most popular commercial models like OpenAI’s Sora, Pika, Gen-2 and
90 and 214 words respectively. MiraData surpasses previous datasets Google’s Veo.
in visual quality and motion strength, making it ideal for text-to-video
GenVidDet [165] is a large-scale video dataset created for AI-
generation tasks.
generated video detection, comprising over 2.66 million clips with
GenVideo [162] is a large-scale dataset developed to evaluate more than 4442 hours of content. It includes real videos sourced from
the generalizability and robustness of AI-generated video detection the InternVid [202] and HD-VG-130M [204] datasets, totaling over
models. The training set contains 2, 294, 594 video clips, including 1, 1.46 million clips, and AI-generated videos from the VidProM dataset
213, 511 real and 1, 081, 083 fake videos, while the testing set includes using four different models, adding approximately 1.12 million clips.
19, 588 video clips, with 10, 000 real and 8, 588 fake videos. The dataset Additionally, new AI-generated videos were created using the latest
features high-quality fake videos sourced from open-source websites models like Open-Sora, StreamingT2V and DynamiCrafter to enhance
and various pre-trained models, covering a wide range of scenes the dataset’s diversity.
such as landscapes, people, buildings, and objects. Video duration’s

- 199 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

Temporal Diffusion
Consistency models

Computational Interpretability
Requirements Transparency

Constant Future
Challenges adaptation
Zero-shot
Learning
Trends

Multimodal
Generalizability
generation

Ethical Model
Aspects robustness

Fig. 11. Overview of trends and challenges in the generation and detection of AI-generated image and video samples.

Turns Out I’m Not Real (TOINR) [167] dataset was constructed community. Strict criteria for safety, diversity, and quality were
to evaluate a method using public video generation tools, including applied, involving multi-level risk detection to eliminate high-risk
Stable Video Diffusion (SVD), Pika, Gen-2, and SORA. The dataset content and video fingerprinting to ensure a balanced distribution.
includes 1,000 real video clips from the ImageNet Video Visual Additionally, the dataset includes 0.3 million videos for downstream
Relation Detection (VidVRD) [220] dataset and 1,000 fake video clips benchmarks, designed to assess video-text retrieval, video captioning,
generated with SVD-XT [89]. It also comprises an additional real and and video category classification tasks.
fake clips for out-domain testing: 107 real (VidVRD) and 107 fake clips Panda-70m [208] is a large-scale video dataset created for video
generated with Pika, 107 (VidVRD) real and 107 fake clips generated captioning, video and text retrieval, and text-driven video generation.
with Gen-2, and 207 real and 191 fake clips sourced from YouTube and It consists of 70 million high-resolution, semantically coherent video
SORA website. clips with captions. The dataset was developed from 3.8 million long
HD-VILA-100M [201] is a high-resolution and diversified video- videos collected from HD-VILA-100M [201]. To generate accurate
language dataset designed to overcome limitations in existing datasets. captions, a two-stage semantics-aware splitting algorithm was
Introduced to aid tasks such as Text-to-video retrieval and video used, followed by multiple cross-modality teacher models to predict
QA, it comprises 103 million video clip and sentence pairs from 3.3 candidate captions. A subset of 100,000 videos was manually annotated
million videos, totaling 371.5K hours. Sourced from diverse YouTube to fine-tune a retrieval model, which then selected the best captions
content, including professional channels like BBC Earth and National for the entire dataset. Panda-70M addresses the challenge of collecting
Geography, HD-VILA-100M emphasizes quality and alignment high-quality video-text data and shows significant improvements in
of videos and transcriptions. Only videos with subtitles and 720p downstream tasks. The dataset primarily contains vocal-intensive
resolution were included, resulting in a final set of 3.3 million videos, videos such as news, TV shows, and documentaries.
balanced across 15 categories. For video-text pairing, the dataset VAST-27M [209] consists of a total of 27 million video clips
utilizes video transcriptions instead of manual annotations, offering covering diverse categories, each paired with 11 captions (5 vision,
richer information. Subtitles, often generated by ASR, were split into 5 audio, and 1 omni-modality). The average lengths of vision, audio,
complete sentences using an off-the-shelf tool. Sentences were aligned and omni-modality captions are 12.5, 7.2, and 32.4 words respectively.
with video clips using Dynamic Time Warping, producing pairs The dataset bridges various modalities including vision, audio, and
averaging 13.4 seconds in length and 32.5 words per sentence. subtitles in videos. The clips were selected from the HD-VILA-100M
HD-VG-130M [204] is a large-scale dataset for Text-to-video dataset [201], ensuring each clip is between 5 and 30 seconds long and
generation, comprising 130 million text-video pairs from the open contains all three modalities. Vision captions were generated using a
domain. Created to address limitations in existing datasets, it model trained on corpora such as MSCOCO, VATEX, MSRVTT, and
features high-definition (720p), widescreen, and watermark-free MSVD [214], while audio captions were generated using VALOR-1M
videos. Collected from YouTube, the videos were processed using and WavCaps datasets. An LLM, Vicuna-13b, was used to integrate
PySceneDetect for scene detection, resulting in single-scene clips of less these captions into a single omni-modality caption. VAST-27M spans
than 20 seconds each. Captions were generated using BLIP-2, ensuring over 15 categories, including music, gaming, education, entertainment,
that descriptions, typically around 10 words, are representative of the and animals. its comprehensiveness, the dataset may inherit biases
visual content. Covering 15 categories, HD-VG-130M provides diverse from the corpora and models used in its creation, highlighting the
and high-quality data for training video generation models. need for more diverse and larger-scale omni-modality corpora.
Youku-mPLUG [205] is the first Chinese video-language
pretraining dataset, released in 2023 and collected from the Youku VI. Challenges and Future Trends
video-sharing platform. It comprises 10 million high-quality Chinese
video-text pairs filtered from 400 million raw videos, covering 45 Throughout this state-of-the-art review we have analysed the most
diverse categories with an average video length of 54.2 seconds. This recent approaches and methodologies for the generation and detection
dataset was created to advance Vision-language pre-training (VLP) of synthetic video and image samples. This has given us a global view
and multimodal Large language models (LLMs) within the Chinese of the area, as well as a glimpse of current research trends and the
challenges researchers will have to face in the coming years, see Fig. 11.

- 200 -
Regular Issue

First of all, we will focus on analysing the trends that will drive 1. Temporal Consistency. One of the main problems in the
research in the area in the coming years, based on the results obtained generation of synthetic video samples is the formation of
from this analysis. artefacts or inconsistencies between the created frames. Smooth
1. Sample generation with diffusion models, where the diffusion and realistic motion patterns are essential for video sequences,
process in these models involves iterating over the input data however generative models may find it difficult to maintain this
and gradually refining the generation to fit a target distribution from frame to frame. In addition, inconsistent frame transitions
or to achieve the desired effect. As we have been able to observe can lead to visual artifacts such as flicker, which affect the realism
throughout the different sections related to the generation of of the generated content. Although advances in techniques such
samples, whether video or image, the diffusion models seem to as Implicit Neural Representations (INR), interplacing of multiple
be predominating over the rest of the generation techniques, such temporal attention layers, fully fine tuning on video datasets, as
as autoencoders or GANs. Taking into account all the research well as hierarchical discriminators have shown promise, further
being carried out in this domain, it would not be surprising to see research is necessary to achieve smooth and realistic video
it monopolises multimedia content generation techniques in the sequences.
coming years. 2. Computational Requirements. Video generation and detection
2. Zero-Shot Learning. This learning approach is a game changer, involves processing high dimensional data, which significantly
as it allows generative models to create content in new domains, increases the computational requirements for training and
even with entirely new features, without needing to be trained with inference, which can be an obstacle for small organizations.
data from those exact situations. This makes it possible, within Developing more efficient algorithms and parallelization
generative techniques, to generate a wide range of content, even techniques for video generation is an ongoing challenge.
when a large amount of labelled data is not available. But it remains 3. Constant adaptation: as we have seen in this survey, there are
difficult to develop models capable of accurately understanding two main lines of research: the generation of synthetic samples
and generating content in completely new contexts. Regarding and their detection techniques. Every day there are new, more
detection, zero-shot learning has the potential to help identify AI- sophisticated generation techniques that generate more realistic
generated content in many different data types and formats, even samples, so new detection models that are capable of distinguishing
in the absence of huge curated datasets. However, the wide variety these synthetic samples from the real ones have to be constantly
of synthetic content creation methods makes it difficult to create developed, i.e. it is a race. As well as the development of new
perfectly adapted detection models. Further research is needed to quality datasets that will be the starting point of the detection
determine how to improve the generalisability of these models. systems. Another approach may be the periodic retraining of
3. Interpretability and Transparency. As the content generated models. Whether to simply re-train a model from scratch or
by AI becomes more sophisticated, it becomes increasingly continue to update it through continuous learning is an ongoing
important to ensure that detection models are not only effective, challenge that researchers are still working on.
but also easy to understand. Users need to be convinced that the 4. Generalizability of Detection Models. A key challenge for
model is making the right decisions, which means that the model detection models is to be able to handle new data and new models.
needs to provide clear and understandable reasons for why it has Generative AI models (GAIMs) evolve rapidly and if a detection
identified something as synthetic. In addition, these techniques model is too focused on the specific data it has been trained on,
allow us to understand whether the features that the models it tends to struggle with new, unseen data and updated models.
are using to achieve at the output are adequate or whether the To remain relevant and effective, detection models must be
system has deficiencies or biases. Therefore, the application of able to generalise to different datasets and types of generative
explainability techniques has many advantages. architectures.
4. Multimodal data generation. As we have seen in Section V, 5. Ethical Aspects. The realistic nature of AI-generated content
multimodal sample generation techniques are the least explored raises serious ethical questions, particularly when it comes to
of all. The main reason may be their complexity, as a very precise potential misuse. Deepfakes, fake news and other misleading
synchronisation between video and audio has to be achieved. content can cause real harm. To combat this, it is not enough
However, it is quite possible that this approach will start to become to develop effective detection methods. We also need ethical
more relevant, due to the opportunities it presents. Regarding guidelines, regulations and access controls to prevent AI
synthetic multimodal data detection techniques, research will technology from being used in harmful ways.
be extremely limited until quality datasets are available to train
robust models, capable of being applied to real situations.
VII. Conclusions
5. Model robustness. Detection models must be able to robustly
withstand various transformations and adversarial attacks, such Generative AI has witnessed exponential growth in recent years,
as image compression, blurring or text paraphrasing, which exemplified by tools like ChatGPT that showcase its advancing
can significantly degrade detection performance. The ability capabilities. Multimedia content generation models have achieved
to withstand such manipulations is crucial for the reliable remarkable performance across a variety of tasks, offering
identification of synthetic content in various real-world scenarios. substantial benefits to domains such as entertainment, education,
These types of distortions can effectively compromise a model’s and cybersecurity. However, these advancements also introduce risks
ability to correctly identify synthetic content. So being able to that cannot be ignored. Alongside the development of new generative
overcome these challenges is essential to ensure that the model AI models for producing high-quality multimedia content, there is a
works reliably in all kinds of scenarios. critical need to create detection systems that can be effectively applied
Finally, we are going explore the different challenges that the in real-world situations.
field of video and image generation is likely to face. This review This review aims to address these dual objectives by providing a
has highlighted several weaknesses that must be addressed, as they comprehensive analysis of synthetic image and video generation
represents significant obstacles for future research in this domain. techniques, as well as the methods used for their detection. It also

- 201 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

examines the principal datasets available in the current state of the art Techniques. PhD dissertation, California Institute of the Arts.
and explores future trends and challenges faced by researchers in the [13] A. Doe, B. Smith, C. White, “Gans for medical image synthesis: A
field. By critically evaluating the existing technologies for generating comprehensive review,” Medical Image Analysis, vol. 78, p. 102345, 2023.
and detecting multimedia content, we seek to define the research [14] U. Mittal, S. Sai, V. Chamola, et al., “A comprehensive review on generative
ai for education,” IEEE Access, vol. 12, pp. 142733–142759, 2024.
directions that should be pursued in the coming years. The insights
[15] H. S. Mavikumbure, V. Cobilean, C. S. Wickramasinghe, D. Drake, M.
gathered from this survey are intended to facilitate and stimulate
Manic, “Generative ai in cyber security of cyber physical systems:
further research on generative AI techniques for multimedia content, Benefits and threats,” in 2024 16th International Conference on Human
ultimately contributing to both the advancement of the field and the System Interaction (HSI), 2024, pp. 1–8, IEEE.
mitigation of associated risks. [16] S. Oh, T. Shon, “Cybersecurity issues in generative ai,” in 2023
International Conference on Platform Technology and Service (PlatCon),
2023, pp. 97–100, IEEE.
Acknowledgment
[17] H. Liz-Lopez, M. Keita, A. Taleb-Ahmed, A. Hadid, J. Huertas-Tato,
D. Camacho, “Generation and detection of manipulated multimodal
This work has been partially supported by the project PCI2022-
audiovisual content: Advances, trends and open challenges,” Information
134990-2 (MARTINI) of the CHISTERA IV Cofund 2021 program; Fusion, vol. 103, p. 102103, 2024.
by MCIN/AEI/10.13039/501100011033/ and European Union [18] A. Giron, J. Huertas-Tato, D. Camacho, “Multimodal analysis for
NextGenerationEU/PRTR for XAI-Disinfodemics (PLEC 2021-007681) identifying misinformation in social networks,” in The 2024 World
grant, by European Comission under IBERIFIER Plus - Iberian Digital Congress on Information Technology Applications and Services, 2024,
Media Observatory (DIGITAL-2023-DEPLOY- 04-EDMO-HUBS World IT Congress 2024.
101158511), and by TUAI Project (HORIZON-MSCA-2023-DN-01-01, [19] K. Shiohara, T. Yamasaki, “Detecting deepfakes with self-blended images,”
Proposal number: 101168344); by EMIF managed by the Calouste in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Gulbenkian Foundation, in the project MuseAI; and by Comunidad Recognition, 2022, pp. 18720–18729.
Autonoma de Madrid, CIRMA-CM Project (TEC-2024/COM-404). [20] A. Martín, A. Hernández, M. Alazab, J. Jung, D. Camacho, “Evolving
generative adversarial networks to improve image steganography,”
Abdenour Hadid is funded by TotalEnergies collaboration agreement
Expert Systems with Applications, vol. 222, p. 119841, 2023.
with Sorbonne University Abu Dhabi. [21] Á. Huertas-García, A. Martín, J. Huertas-Tato, D. Camacho, “Camouflage
is all you need: Evaluating and enhancing transformer models robustness
References against camouflage adversarial attacks,” IEEE Transactions on Emerging
Topics in Computational Intelligence, 2024.
[1] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, “Hierarchical [22] T. Zhang, “Deepfake generation and detection, a survey,” Multimedia
text-conditional image generation with clip latents,” arXiv preprint Tools and Applications, vol. 81, no. 5, pp. 6259–6276, 2022.
arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022. [23] S. Tyagi, D. Yadav, “A detailed analysis of image and video forgery
[2] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, detection techniques,” The Visual Computer, vol. 39, no. 3, pp. 813–833,
I. Sutskever, M. Chen, “Glide: Towards photorealistic image generation 2023.
and editing with text-guided diffusion models,” 2022. [Online]. Available: [24] Z. Jia, Z. Zhang, L. Wang, T. Tan, “Human image generation: A
https://arxiv.org/abs/2112.10741. comprehensive survey,” ACM Computing Surveys, 2022.
[3] Midjourney, “Midjourney platform.” Online. [Online]. Available: https:// [25] A. Figueira, B. Vaz, “Survey on synthetic data generation, evaluation
www.midjourney.com/home, Accessed: Nov. 07, 2024. methods and gans,” Mathematics, vol. 10, no. 15, p. 2733, 2022.
[4] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. [26] T. T. Nguyen, Q. V. H. Nguyen, D. T. Nguyen, D. T. Nguyen, T. Huynh-
Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al., The, S. Nahavandi, T. T. Nguyen, Q.-V. Pham, C. M. Nguyen, “Deep
“Photorealistic text-to-image diffusion models with deep language learning for deepfakes creation and detection: A survey,” Computer Vision
understanding,” Advances in neural information processing systems, vol. and Image Understanding, vol. 223, p. 103525, 2022.
35, pp. 36479–36494, 2022. [27] A. Bauer, S. Trapp, M. Stenger, R. Leppich, S. Kounev, M. Leznik, K. Chard,
[5] D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, I. Foster, “Comprehensive exploration of synthetic data generation: A
H. Akbari, Y. Alon, V. Birodkar, et al., “Videopoet: A large language model survey,” arXiv preprint arXiv:2401.02524, 2024.
for zero-shot video generation,” arXiv preprint arXiv:2312.14125, 2023. [28] P. Cao, F. Zhou, Q. Song, L. Yang, “Controllable generation with text-to-
[6] OpenAI, “Sora: Video generation models as world simulators,” OpenAI, image diffusion models: A survey,” arXiv preprint arXiv:2403.04279, 2024.
2024. [Online]. Available: https://openai.com/index/sora/, Accessed: Nov. [29] I. Joshi, M. Grimmer, C. Rathgeb, C. Busch, F. Bremond, A. Dantcheva,
07, 2024. “Synthetic data in human analysis: A survey,” IEEE Transactions on
[7] J. Bruce, M. D. Dennis, A. Edwards, J. Parker- Holder, Y. Shi, E. Hughes, Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 4957–4976,
M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al., “Genie: Generative 2024, doi: 10.1109/TPAMI.2024.3362821.
interactive environments,” in Proceedings of the 41st International [30] P. Cao, F. Zhou, Q. Song, L. Yang, “Controllable generation with text-
Conference on Machine Learning, vol. 235 of Proceedings of Machine to-image diffusion models: A survey,” 2024. [Online]. Available: https://
Learning Research, 21–27 Jul 2024, pp. 4603–4623, PMLR. arxiv.org/abs/2403.04279.
[8] G. Madaan, S. K. Asthana, J. Kaur, “Generative ai: Applications, models, [31] T. Zhang, Z. Wang, J. Huang, M. M. Tasnim, W. Shi, “A survey of
challenges, opportunities, and future directions,” Generative AI and diffusion based image generation models: Issues and their solutions,”
Implications for Ethics, Security, and Data Management, pp. 88–121, 2024. 2023. [Online]. Available: https://arxiv.org/abs/2308.13142.
[9] X. Zhao, X. Zhao, “Application of generative artificial intelligence in film [32] A. Sauer, T. Karras, S. Laine, A. Geiger, T. Aila, “Stylegan-t: Unlocking the
image production,” Computer- Aided Design & Applications, vol. 21, pp. power of gans for fast large-scale text-to-image synthesis,” in International
29–43, 2024, doi: 10.14733/cadaps.2024.S27.29-43. conference on machine learning, 2023, pp. 30105–30118, PMLR.
[10] Á. Huertas-García, H. Liz, G. Villar-Rodríguez, Martín, J. Huertas-Tato, D. [33] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, T. Park,
Camacho, “Aida- upm at semeval-2022 task 5: Exploring multimodal late “Scaling up gans for text-to-image synthesis,” in Proceedings of the IEEE/
information fusion for multimedia automatic misogyny identification,” CVF Conference on Computer Vision and Pattern Recognition, 2023, pp.
in Proceedings of the 16th International Workshop on Semantic Evaluation 10124–10134.
(SemEval-2022), 2022, pp. 771–779. [34] H. Ku, M. Lee, “Textcontrolgan: Text-to-image synthesis with controllable
[11] N. Anantrasirichai, D. Bull, “Artificial intelligence in the creative generative adversarial networks,” Applied Sciences, vol. 13, no. 8, p. 5098,
industries: a review,” Artificial intelligence review, vol. 55, no. 1, pp. 589– 2023.
656, 2022. [35] M. Tao, B.-K. Bao, H. Tang, C. Xu, “Galip: Generative adversarial clips
[12] H. Choi, Generative AI Art Exploration and Image Generation Fine Tuning for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on

- 202 -
Regular Issue

Computer Vision and Pattern Recognition, 2023, pp. 14214–14223. CVF Conference on Computer Vision and Pattern Recognition, 2024, pp.
[36] Y. A. Ahmed, A. Mittal, “Unsupervised co-generation of foreground- 4567–4576.
background segmentation from text- to-image synthesis,” in Proceedings [57] H. Zhang, Y. Wang, K. Liu, “Unified multimodal gan for diverse image-
of the IEEE/CVF Winter Conference on Applications of Computer Vision, to-image translation,” IEEE Transactions on Neural Networks and Learning
vol. 12, 2024, pp. 5058–5069. Systems, vol. 35, pp. 234–245, 2024.
[37] Y. Xu, Y. Zhao, Z. Xiao, T. Hou, “Ufogen: You forward once large scale [58] M. Lee, S. Kim, D. Park, “Zero-shot gans: Generating images without
text-to-image generation via diffusion gans,” in Proceedings of the IEEE/ extensive labeled data,” IEEE Transactions on Pattern Analysis and
CVF Conference on Computer Vision and Pattern Recognition (CVPR), June Machine Intelligence, vol. 46, pp. 567–578, 2024.
2024, pp. 8196–8206. [59] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, T.
[38] J. Shi, C. Wu, J. Liang, X. Liu, N. Duan, “Divae: Photorealistic Aila, “Alias-free generative adversarial networks,” in Proceedings of the
images synthesis with denoising diffusion decoder,” arXiv preprint 35th Conference on Neural Information Processing Systems (NeurIPS), 2021.
arXiv:2206.00386, 2022. [60] T. Karras, T. Aila, S. Laine, J. Lehtinen, “Progressive growing of gans for
[39] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196,
Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual 2017.
models from natural language supervision,” in International conference on [61] J. Smith, J. Doe, A. Brown, “Efficientgan: Reducing the computational
machine learning, 2021, pp. 8748–8763, PMLR. cost of gans while preserving image quality,” Journal of Machine Learning
[40] H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.- Research, vol. 23, pp. 1234–1256, 2022.
H. Yang, K. Murphy, W. T. Freeman, M. Rubinstein, Y. Li, D. Krishnan, [62] E. Johnson, R. Wang, M. Li, “High-resolution image synthesis with latent
“Muse: Text-to-image generation via masked generative transformers,” diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer
2023. [Online]. Available: https://arxiv.org/abs/2301.00704. Vision and Pattern Recognition, 2022, pp. 1204–1213.
[41] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. [63] D. Torbunov, Y. Huang, H. Yu, J. Huang, S. Yoo, M. Lin, B. Viren, Y. Ren,
Shao, H. Yang, et al., “Cogview: Mastering text-to-image generation via “Uvcgan: Unet vision transformer cycle-consistent gan for unpaired
transformers,” Advances in neural information processing systems, vol. 34, image- to-image translation,” in Proceedings of the IEEE/CVF winter
pp. 19822–19835, 2021. conference on applications of computer vision, 2023, pp. 702–712.
[42] M. Ding, W. Zheng, W. Hong, J. Tang, “Cogview2: Faster and better text- [64] W. Harvey, S. Naderiparizi, F. Wood, “Conditional image generation by
to-image generation via hierarchical transformers,” Advances in Neural conditioning variational auto- encoders,” arXiv preprint arXiv:2102.12037,
Information Processing Systems, vol. 35, pp. 16890– 16902, 2022. 2022.
[43] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, Ommer, “High-resolution [65] A. Razavi, A. van den Oord, O. Vinyals, “Hierarchical variational
image synthesis with latent diffusion models,” in Proceedings of the IEEE/ autoencoders for high-resolution image synthesis,” Nature, vol. 570, pp.
CVF conference on computer vision and pattern recognition, 2022, pp. 234–239, 2022.
10684–10695. [66] A. Vahdat, J. Kautz, “Nvae: A deep hierarchical variational autoencoder,”
[44] A. Razzhigaev, A. Shakhmatov, A. Maltseva, Arkhipkin, I. Pavlov, I. arXiv preprint arXiv:2007.03898, 2022.
Ryabov, A. Kuts, Panchenko, A. Kuznetsov, D. Dimitrov, “Kandinsky: an [67] J.-Y. Zhu, T. Park, A. A. Efros, “Stylevae: Variational autoencoders with
improved text-to-image synthesis with image prior and latent diffusion,” style transfer for image synthesis,” IEEE Transactions on Pattern Analysis
arXiv preprint arXiv:2310.03502, 2023. and Machine Intelligence, vol. 45, pp. 2345–2356, 2023.
[45] J. Yang, J. Feng, H. Huang, “Emogen: Emotional image content generation [68] H. Kim, A. Mnih, “Factorized hierarchical variational autoencoders
with text-to-image diffusion models,” in Proceedings of the IEEE/CVF for disentangled representation learning,” Journal of Machine Learning
Conference on Computer Vision and Pattern Recognition, 2024, pp. 6358– Research, vol. 24, pp. 3456–3465, 2023.
6368. [69] D. E. Diamantis, P. Gatoula, D. K. Iakovidis, “Endovae: Generating
[46] H. Li, C. Shen, P. Torr, V. Tresp, J. Gu, “Self- discovering interpretable endoscopic images with a variational autoencoder,” in 2022 IEEE 14th
diffusion latent directions for responsible text-to-image generation,” in Image, Video, and Multidimensional Signal Processing Workshop (IVMSP),
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern 2022, pp. 1–5, IEEE.
Recognition, 2024, pp. 12006–12016. [70] R. Dos Santos, J. Aguilar, “A synthetic data generation system based on
[47] J. Ho, T. Salimans, “Classifier-free diffusion guidance,” 2022. [Online]. the variational-autoencoder technique and the linked data paradigm,”
Available: https://arxiv.org/abs/2207.12598. Progress in Artificial Intelligence, pp. 1–15, 2024.
[48] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. [71] S. An, J.-J. Jeon, “Distributional learning of variational autoencoder:
Penna, R. Rombach, “Sdxl: Improving latent diffusion models for high- Application to synthetic data generation,” in Advances in Neural
resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023. Information Processing Systems, vol. 36, 2023, pp. 57825–57851, Curran
[49] Z. Xue, G. Song, Q. Guo, B. Liu, Z. Zong, Y. Liu, P. Luo, “Raphael: Text- Associates, Inc.
to-image generation via large mixture of diffusion paths,” Advances in [72] G. Parmar, K. Kumar Singh, R. Zhang, Y. Li, J. Lu, J.-Y. Zhu, “Zero-
Neural Information Processing Systems, vol. 36, 2024. shot image-to-image translation,” in ACM SIGGRAPH 2023 Conference
[50] G. DeepMind, “Imagen 2.” http://tinyurl.com/3pakj3mk, 2023. Proceedings, 2023, pp. 1–11.
[51] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, [73] A. Brock, J. Donahue, K. Simonyan, “Large scale gan training for high
J. Lee, Y. Guo, et al., “Improving image generation with better captions,” fidelity natural image synthesis,” 2019. [Online]. Available: https://arxiv.
Computer Science. https://cdn.openai.com/papers/dall-e- 3.pdf, vol. 2, no. 3, org/abs/1809.11096.
p. 8, 2023. [74] S. Sinitsa, O. Fried, “Deep image fingerprint: Towards low budget
[52] L. Chen, W. Zhao, L. Xu, “Augmented cyclegan for enhanced image-to- synthetic image detection and model lineage analysis,” in Proceedings of
image translation,” in Proceedings of the IEEE/CVF Conference on Computer the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024,
Vision and Pattern Recognition, 2023, pp. 2345–2354. pp. 4067–4076.
[53] Y. Wang, K. Liu, H. Zhang, “Dualgan++: Robust and efficient image-to- [75] N. Poredi, D. Nagothu, Y. Chen, “Ausome: authenticating social
image translation,” IEEE Transactions on Image Processing, vol. 32, pp. media images using frequency analysis,” in Disruptive Technologies in
678–690, 2023. Information Sciences VII, vol. 12542, 2023, pp. 44–56, SPIE.
[54] M. Li, E. Johnson, R. Wang, “Cut++: Enhanced contrastive unpaired [76] Q. Bammey, “Synthbuster: Towards detection of diffusion model
translation for image synthesis,” in Proceedings of the IEEE International generated images,” IEEE Open Journal of Signal Processing, vol. 5, pp. 1–9,
Conference on Computer Vision (ICCV), 2023, pp. 3456–3465. 2023, doi: 10.1109/OJSP.2023.3337714.
[55] T. Nguyen, W. Huang, S. Lee, “Spade++: Spatially- adaptive gans for [77] T. Alzantot, C. Shou, M. Farag, Z. J. Wang, S. Pandey, M. Esmaili,
high-resolution image synthesis,” Pattern Recognition, vol. 122, pp. 108– “Wavelet-packets for deepfake image analysis and detection,” Machine
119, 2022. Learning, vol. 111, no. 11, pp. 1–25, 2022, doi: 10.1007/s10994-022-06225-5.
[56] S. Kim, D. Park, M. Lee, “Self-supervised image translation gan for [78] N. Zhong, Y. Xu, Z. Qian, X. Zhang, “Rich and poor texture contrast: A
high-quality synthetic image generation,” in Proceedings of the IEEE/ simple yet effective approach for ai-generated image detection,” arXiv

- 203 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

preprint arXiv:2311.12397, 2023. [99] T. Karras, M. Aittala, T. Aila, S. Laine, “Elucidating the design space of
[79] Z. Wang, J. Bao, W. Zhou, W. Wang, H. Hu, H. Chen, H. Li, “Dire for diffusion-based generative models,” Advances in neural information
diffusion-generated image detection,” in Proceedings of the IEEE/CVF processing systems, vol. 35, pp. 26565–26577, 2022.
International Conference on Computer Vision, 2023, pp. 22445–22455. [100] T. Chen, L. Li, “Fit: Far-reaching interleaved transformers,” arXiv preprint
[80] R. Ma, J. Duan, F. Kong, X. Shi, K. Xu, “Exposing the fake: Effective arXiv:2305.12689, 2023.
diffusion-generated images detection,” arXiv preprint arXiv:2307.06272, [101] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, Y. Shan,
2023. “Videocrafter2: Overcoming data limitations for high-quality video
[81] J. Huertas-Tato, A. Martín, J. Fierrez, D. Camacho, “Fusing cnns and diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer
statistical indicators to improve image classification,” Information Fusion, Vision and Pattern Recognition, 2024, pp. 7310–7320.
vol. 79, pp. 174–187, 2022. [102] X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, Y. Qiao, “Latte:
[82] P. Lorenz, R. L. Durall, J. Keuper, “Detecting images generated by deep Latent diffusion transformer for video generation,” arXiv preprint
diffusion models using their local intrinsic dimensionality,” in Proceedings arXiv:2401.03048, 2024.
of the IEEE/CVF International Conference on Computer Vision, 2023, pp. [103] X. Li, W. Chu, Y. Wu, W. Yuan, F. Liu, Q. Zhang, F. Li, H. Feng, E. Ding, J.
448–459. Wang, “Videogen: A reference-guided latent diffusion approach for high
[83] L. Guarnera, O. Giudice, S. Battiato, “Level up the deepfake detection: a definition text-to-video generation,” arXiv preprint arXiv:2309.00398,
method to effectively discriminate images generated by gan architectures 2023.
and diffusion models,” arXiv preprint arXiv:2303.00608, 2023. [104] C. Wu, L. Huang, Q. Zhang, B. Li, L. Ji, F. Yang, G. Sapiro, N. Duan,
[84] D. A. Coccomini, A. Esuli, F. Falchi, C. Gennaro, G. Amato, “Detecting “Godiva: Generating open-domain videos from natural descriptions,”
images generated by diffusers,” PeerJ Computer Science, vol. 10, p. e2127, arXiv preprint arXiv:2104.14806, 2021.
2024. [105] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic,
[85] U. Ojha, Y. Li, Y. J. Lee, “Towards universal fake image detectors that “Howto100m: Learning a text-video embedding by watching hundred
generalize across generative models,” in Proceedings of the IEEE/CVF million narrated video clips,” in Proceedings of the IEEE/CVF international
Conference on Computer Vision and Pattern Recognition, 2023, pp. 24480– conference on computer vision, 2019, pp. 2630–2640.
24489. [106] W. Hong, M. Ding, W. Zheng, X. Liu, J. Tang, “Cogvideo: Large-scale
[86] M. Mathys, M. Willi, R. Meier, “Synthetic photography detection: A pretraining for text-to-video generation via transformers,” arXiv preprint
visual guidance for identifying synthetic images created by ai,” arXiv arXiv:2205.15868, 2022.
preprint arXiv:2408.06398, 2024. [107] C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, N. Duan, “Nüwa: Visual
[87] C. Tan, R. Tao, H. Liu, G. Gu, B. Wu, Y. Zhao, Y. Wei, “C2p-clip: Injecting synthesis pre-training for neural visual world creation,” in European
category common prompt in clip to enhance generalization in deepfake conference on computer vision, 2022, pp. 720–736, Springer.
detection,” arXiv preprint arXiv:2408.09647, 2024. [108] C. Wu, J. Liang, X. Hu, Z. Gan, J. Wang, L. Wang, Z. Liu, Y. Fang, N.
[88] M. Keita, W. Hamidouche, H. B. Eutamene, Hadid, A. Taleb-Ahmed, “Bi- Duan, “Nuwa-infinity: Autoregressive over autoregressive generation for
lora: A vision- language approach for synthetic image detection,” Pattern infinite visual synthesis,” arXiv preprint arXiv:2207.09814, 2022.
Recognition, 2024. Preprint available at https://github.com/Mamadou- [109] W. Yan, Y. Zhang, P. Abbeel, A. Srinivas, “Videogpt: Video generation
Keita/VLM-DETECT. using vq-vae and transformers,” arXiv preprint arXiv:2104.10157, 2021.
[89] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, D. J. Fleet, “Video [110] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, K.
diffusion models,” Advances in Neural Information Processing Systems, vol. Kreis, “Align your latents: High-resolution video synthesis with latent
35, pp. 8633–8646, 2022. diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer
[90] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, Vision and Pattern Recognition, 2023, pp. 22563–22575.
B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition [111] H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen,
video generation with diffusion models,” arXiv preprint arXiv:2210.02303, X. Wang, et al., “Videocrafter1: Open diffusion models for high-quality
2022. video generation,” arXiv preprint arXiv:2310.19512, 2023.
[91] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. [112] Y. He, T. Yang, Y. Zhang, Y. Shan, Q. Chen, “Latent video diffusion models
Ashual, O. Gafni, et al., “Make- a-video: Text-to-video generation without for high-fidelity long video generation,” arXiv preprint arXiv:2211.13221,
text-video data,” arXiv preprint arXiv:2209.14792, 2022. 2022.
[92] D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, J. Feng, “Magicvideo: [113] J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, S. Zhang, “Modelscope
Efficient video generation with latent diffusion models,” arXiv preprint text-to-video technical report,” arXiv preprint arXiv:2308.06571, 2023.
arXiv:2211.11018, 2022. [114] A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, L. Fei- Fei, I. Essa, L. Jiang, J.
[93] L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Lezama, “Photorealistic video generation with diffusion models,” arXiv
Y. Cheng, A. Gupta, X. Gu, A. G. Hauptmann, et al., “Language model preprint arXiv:2312.06662, 2023.
beats diffusion– tokenizer is key to visual generation,” arXiv preprint [115] R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M.
arXiv:2310.05737, 2023. T. Saffar, S. Castro, J. Kunze, D. Erhan, “Phenaki: Variable length video
[94] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, M. Tagliasacchi, generation from open domain textual descriptions,” in International
“Soundstream: An end-to-end neural audio codec,” IEEE/ACM Conference on Learning Representations, 2022.
Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495– [116] Z. Xing, Q. Dai, H. Hu, Z. Wu, Y.-G. Jiang, “Simda: Simple diffusion
507, 2021. adapter for efficient video generation,” in Proceedings of the IEEE/CVF
[95] J. Huertas-Tato, A. Martín, D. Camacho, “Understanding writing style Conference on Computer Vision and Pattern Recognition, 2024, pp. 7827–
in social media with a supervised contrastively pre-trained transformer,” 7839.
Knowledge-Based Systems, vol. 296, p. 111867, 2024. [117] D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, M. Z. Shou,
[96] R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, “Show-1: Marrying pixel and latent diffusion models for text-to-video
A. Shah, X. Yin, D. Parikh, I. Misra, “Emu video: Factorizing text- generation,” arXiv preprint arXiv:2309.15818, 2023.
to-video generation by explicit image conditioning,” arXiv preprint [118] L. Khachatryan, A. Movsisyan, V. Tadevosyan, R. Henschel, Z. Wang,
arXiv:2311.10709, 2023. S. Navasardyan, H. Shi, “Text2video-zero: Text-to-image diffusion
[97] Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. models are zero-shot video generators,” in Proceedings of the IEEE/CVF
Yu, P. Yang, et al., “Lavie: High-quality video generation with cascaded International Conference on Computer Vision, 2023, pp. 15954–15964.
latent diffusion models,” arXiv preprint arXiv:2309.15103, 2023. [119] W. Weng, R. Feng, Y. Wang, Q. Dai, C. Wang, D. Yin, Z. Zhao, K. Qiu, J.
[98] W. Menapace, A. Siarohin, I. Skorokhodov, E. Deyneka, T.-S. Chen, Bao, Y. Yuan, et al., “Art-v: Auto-regressive text-to-video generation with
A. Kag, Y. Fang, A. Stoliar, E. Ricci, J. Ren, et al., “Snap video: Scaled diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer
spatiotemporal transformers for text-to-video synthesis,” in Proceedings Vision and Pattern Recognition, 2024, pp. 7395–7405.
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, [120] F. Shi, J. Gu, H. Xu, S. Xu, W. Zhang, L. Wang, “Bivdiff: A training-free
2024, pp. 7038–7048. framework for general-purpose video synthesis via bridging image and

- 204 -
Regular Issue

video diffusion models,” in Proceedings of the IEEE/CVF Conference on [141] C. Qi, X. Cun, Y. Zhang, C. Lei, X. Wang, Y. Shan, Q. Chen, “Fatezero:
Computer Vision and Pattern Recognition, 2024,pp. 7393–7402. Fusing attentions for zero- shot text-based video editing,” in Proceedings
[121] Z. Qing, S. Zhang, J. Wang, X. Wang, Y. Wei, Y. Zhang, C. Gao, N. Sang, of the IEEE/CVF International Conference on Computer Vision, 2023, pp.
“Hierarchical spatio- temporal decoupling for text-to-video generation,” 15932–15942.
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [142] E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y. Matias, Y. Pritch, Y.
Recognition, 2024, pp. 6635–6645. Leviathan, Y. Hoshen, “Dreamix: Video diffusion models are general
[122] R. Wu, L. Chen, T. Yang, C. Guo, C. Li, X. Zhang, “Lamp: Learn a video editors,” arXiv preprint arXiv:2302.01329, 2023.
motion pattern for few-shot-based video generation,” arXiv preprint [143] Z. Hu, D. Xu, “Videocontrolnet: A motion-guided video-to-video
arXiv:2310.10769, 2023. translation framework by using diffusion model with controlnet,” arXiv
[123] X. Guo, M. Zheng, L. Hou, Y. Gao, Y. Deng, C. Ma, W. Hu, Z. Zha, H. preprint arXiv:2307.14073, 2023.
Huang, P. Wan, et al., “I2v-adapter: A general image-to-video adapter for [144] F. Liang, B. Wu, J. Wang, L. Yu, K. Li, Y. Zhao, I. Misra, J.-B. Huang, P.
video diffusion models,” arXiv preprint arXiv:2312.16693, 2023. Zhang, P. Vajda, et al., “Flowvid: Taming imperfect optical flows for
[124] D. J. Zhang, D. Li, H. Le, M. Z. Shou, C. Xiong, D. Sahoo, “Moonshot: consistent video- to-video synthesis,” in Proceedings of the IEEE/CVF
Towards controllable video generation and editing with multimodal Conference on Computer Vision and Pattern Recognition, 2024, pp. 8207–
conditions,” arXiv preprint arXiv:2401.01827, 2024. 8216.
[125] L. Gong, Y. Zhu, W. Li, X. Kang, B. Wang, T. Ge, B. Zheng, “Atomovideo: [145] B. Wu, C.-Y. Chuang, X. Wang, Y. Jia, K. Krishnakumar, T. Xiao, F. Liang,
High fidelity image-to-video generation,” arXiv preprint arXiv:2403.01800, L. Yu, P. Vajda, “Fairy: Fast parallelized instruction-guided video- to-
2024. video synthesis,” in Proceedings of the IEEE/CVF Conference on Computer
[126] X. Shi, Z. Huang, F.-Y. Wang, W. Bian, D. Li, Y. Zhang, M. Zhang, K. C. Vision and Pattern Recognition, 2024, pp. 8261–8270.
Cheung, S. See, H. Qin, et al., “Motion-i2v: Consistent and controllable [146] M. Ku, C. Wei, W. Ren, H. Yang, W. Chen, “Anyv2v: A plug-and-
image-to- video generation with explicit motion modeling,” in ACM play framework for any video-to-video editing tasks,” arXiv preprint
SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11. arXiv:2403.14468, 2024.
[127] W. Ren, H. Yang, G. Zhang, C. Wei, X. Du, S. Huang, W. Chen, “Consisti2v: [147] W. Ouyang, Y. Dong, L. Yang, J. Si, X. Pan, “I2vedit: First-frame-guided
Enhancing visual consistency for image-to-video generation,” arXiv video editing via image-to-video diffusion models,” arXiv preprint
preprint arXiv:2402.04324, 2024. arXiv:2405.16537, 2024.
[128] C. Shen, Y. Gan, C. Chen, X. Zhu, L. Cheng, T. Gao, J. Wang, “Decouple [148] H. Ouyang, Q. Wang, Y. Xiao, Q. Bai, J. Zhang, K. Zheng, X. Zhou, Q.
content and motion for conditional image-to-video generation,” in Chen, Y. Shen, “Codef: Content deformation fields for temporally
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, consistent video processing,” in Proceedings of the IEEE/CVF Conference
pp. 4757–4765. on Computer Vision and Pattern Recognition, 2024, pp. 8089–8099.
[129] L. Hu, “Animate anyone: Consistent and controllable image-to-video [149] Y. Gu, Y. Zhou, B. Wu, L. Yu, J.-W. Liu, R. Zhao, J. Z. Wu, D. J. Zhang,
synthesis for character animationmagic,” in Proceedings of the IEEE/CVF M. Z. Shou, K. Tang, “Videoswap: Customized video subject swapping
Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153– with interactive semantic point correspondence,” in Proceedings of the
8163. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024,
[130] Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, M. Z. Shou, pp. 7621–7630.
“Magicanimate: Temporally consistent human image animation using [150] J. Bai, T. He, Y. Wang, J. Guo, H. Hu, Z. Liu, J. Bian, “Uniedit: A unified
diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer tuning-free framework for video motion and appearance editing,” arXiv
Vision and Pattern Recognition, 2024, pp. 1481–1490. preprint arXiv:2402.13185, 2024.
[131] M. Dorkenwald, T. Milbich, A. Blattmann, R. Rombach, K. G. Derpanis, B. [151] Y. Hu, C. Luo, Z. Chen, “Make it move: controllable image-to-video
Ommer, “Stochastic image-to-video synthesis using cinns,” in Proceedings generation with text descriptions,” in Proceedings of the IEEE/CVF
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Conference on Computer Vision and Pattern Recognition, 2022, pp. 18219–
2021, pp. 3742–3753. 18228.
[132] H. Ni, C. Shi, K. Li, S. X. Huang, M. R. Min, “Conditional image-to-video [152] Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin,
generation with latent flow diffusion models,” in Proceedings of the IEEE/ B. Dai, “Animatediff: Animate your personalized text-to-image diffusion
CVF conference on computer vision and pattern recognition, 2023, pp. models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.
18444–18455. [153] S. Yin, C. Wu, H. Yang, J. Wang, X. Wang, M. Ni, Z. Yang, L. Li, S. Liu, F.
[133] C. Wang, J. Gu, P. Hu, S. Xu, H. Xu, X. Liang, “Dreamvideo: High-fidelity Yang, et al., “Nuwa-xl: Diffusion over diffusion for extremely long video
image-to-video generation with image retention and text guidance,” generation,” arXiv preprint arXiv:2303.12346, 2023.
arXiv preprint arXiv:2312.03018, 2023. [154] P. Esser, J. Chiu, P. Atighehchian, J. Granskog, Germanidis, “Structure and
[134] S. Zhang, J. Wang, Y. Zhang, K. Zhao, H. Yuan, Z. Qin, X. Wang, D. Zhao, content-guided video synthesis with diffusion modelss,” in Proceedings
J. Zhou, “I2vgen-xl: High-quality image-to-video synthesis via cascaded of the IEEE/CVF International Conference on Computer Vision, 2023, pp.
diffusion models,” arXiv preprint arXiv:2311.04145, 2023. 7346–7356.
[135] A. Blattmann, T. Milbich, M. Dorkenwald, B. Ommer, “Understanding [155] S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, N. Duan, “Dragnuwa: Fine-
object dynamics for interactive image-to-video synthesis,” in Proceedings grained control in video generation by integrating text, image, and
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, trajectory,” arXiv preprint arXiv:2308.08089, 2023.
2021, pp. 5171–5181. [156] X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao,
[136] W. Menapace, S. Lathuiliere, S. Tulyakov, A. Siarohin, E. Ricci, “Playable J. Zhou, “Videocomposer: Compositional video synthesis with motion
video generation,” in Proceedings of the IEEE/CVF Conference on Computer controllability,” Advances in Neural Information Processing Systems, vol.
Vision and Pattern Recognition, 2021, pp. 10061–10070. 36, 2024.
[137] H. Wang, M. Huang, D. Wu, Y. Li, W. Zhang, “Supervised video-to-video [157] H. Ni, B. Egger, S. Lohit, A. Cherian, Y. Wang, T. Koike-Akino, S. X.
synthesis for single human pose transfer,” IEEE Access, vol. 9, pp. 17544– Huang, T. K. Marks, “Ti2v- zero: Zero-shot image conditioning for text-
17556, 2021. to-video diffusion models,” in Proceedings of the IEEE/CVF Conference on
[138] L. Zhuo, G. Wang, S. Li, W. Wu, Z. Liu, “Fast- vid2vid: Spatial-temporal Computer Vision and Pattern Recognition, 2024, pp. 9015–9025.
compression for video-to- video synthesis,” in European Conference on [158] C. Nash, J. Carreira, J. Walker, I. Barr, A. Jaegle, M. Malinowski, P.
Computer Vision, 2022, pp. 289–305, Springer. Battaglia, “Transframer: Arbitrary frame prediction with generative
[139] S. Yang, Y. Zhou, Z. Liu, C. C. Loy, “Rerender a video: Zero-shot text- models,” arXiv preprint arXiv:2203.09494, 2022.
guided video-to-video translation,” in SIGGRAPH Asia 2023 Conference [159] D. S. Vahdati, T. D. Nguyen, A. Azizpour, M. C. Stamm, “Beyond deepfake
Papers, 2023, pp. 1–11. images: Detecting ai- generated videos,” in Proceedings of the IEEE/CVF
[140] W. Wang, Y. Jiang, K. Xie, Z. Liu, H. Chen, Y. Cao, X. Wang, C. Shen, Conference on Computer Vision and Pattern Recognition, 2024, pp. 4397–
“Zero-shot video editing using off-the-shelf image diffusion models,” 4408.
arXiv preprint arXiv:2303.17599, 2023. [160] P. He, L. Zhu, J. Li, S. Wang, H. Li, “Exposing ai- generated videos: A

- 205 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

benchmark dataset and a local- and-global temporal defect based [184] Y. Hong, J. Zhang, “Wildfake: A large-scale challenging dataset for ai-
detection method,” arXiv preprint arXiv:2405.04133, 2024. generated images detection,” arXiv preprint arXiv:2402.11843, 2024.
[161] Z. Peng, L. Dong, H. Bao, Q. Ye, F. Wei, “Beit v2: Masked image modeling with [185] S. Changpinyo, P. Sharma, N. Ding, R. Soricut, “Conceptual 12m: Pushing
vector-quantized visual tokenizers,” arXiv preprint arXiv:2208.06366, 2022. web-scale image-text pre- training to recognize long-tail visual concepts,”
[162] H. Chen, Y. Hong, Z. Huang, Z. Xu, Z. Gu, Y. Li, J. Lan, H. Zhu, J. Zhang, in Proceedings of the IEEE/CVF conference on computer vision and pattern
W. Wang, et al., “Demamba: Ai- generated video detection on million- recognition, 2021, pp. 3558–3568.
scale genvideo benchmark,” arXiv preprint arXiv:2405.19707, 2024. [186] P. Sharma, N. Ding, S. Goodman, R. Soricut, “Conceptual captions:
[163] J. Bai, M. Lin, G. Cao, “Ai-generated video detection via spatio-temporal A cleaned, hypernymed, image alt-text dataset for automatic image
anomaly learning,” arXiv preprint arXiv:2403.16638, 2024. captioning,” in Proceedings of the 56th Annual Meeting of the Association for
[164] L. Ma, J. Zhang, H. Deng, N. Zhang, Y. Liao, H. Yu, “Decof: Generated video Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
detection via frame consistency,” arXiv preprint arXiv:2402.02085, 2024. [187] K. Srinivasan, K. Raman, J. Chen, M. Bendersky, M. Najork, “Wit:
[165] L. Ji, Y. Lin, Z. Huang, Y. Han, X. Xu, J. Wu, C. Wang, Z. Liu, “Distinguish Wikipedia-based image text dataset for multimodal multilingual machine
any fake videos: Unleashing the power of large-scale data and motion learning,” in Proceedings of the 44th international ACM SIGIR conference
features,” arXiv preprint arXiv:2405.15343, 2024. on research and development in information retrieval, 2021, pp. 2443–2449.
[166] H. Xu, J. Zhang, J. Cai, H. Rezatofighi, D. Tao, “Gmflow: Learning optical [188] K. Desai, G. Kaul, Z. Aysola, J. Johnson, “Redcaps: Web-curated
flow via global matching,” in Proceedings of the IEEE/CVF conference on image-text data created by the people, for the people,” arXiv preprint
computer vision and pattern recognition, 2022, pp. 8121–8130. arXiv:2111.11431, 2021.
[167] Q. Liu, P. Shi, Y.-Y. Tsai, C. Mao, J. Yang, “Turns out i’m not real: Towards [189] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M.
robust detection of ai-generated videos,” arXiv preprint arXiv:2406.09601, 2024. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., “Laion-5b:
[168] J. Ricker, S. Damm, T. Holz, A. Fischer, “Towards the detection of diffusion An open large-scale dataset for training next generation image- text
model deepfakes,” arXiv preprint arXiv:2210.14571, 2022. models,” Advances in Neural Information Processing Systems, vol. 35, pp.
[169] H. Song, S. Huang, Y. Dong, W.-W. Tu, “Robustness and generalizability 25278–25294, 2022.
of deepfake detection: A study with diffusion models,” arXiv preprint [190] Z. J. Wang, E. Montoya, D. Munechika, H. Yang, Hoover, D. H. Chau,
arXiv:2309.02218, 2023. “Diffusiondb: A large-scale prompt gallery dataset for text-to-image
[170] L. Papa, L. Faiella, L. Corvitto, L. Maiano, I. Amerini, “On the use of stable generative models,” arXiv preprint arXiv:2210.14896, 2022.
diffusion for creating realistic faces: From generation to detection,” in [191] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, J. Xiao, “Lsun:
2023 11th International Workshop on Biometrics and Forensics (IWBF), Construction of a large-scale image dataset using deep learning with
2023, pp. 1–6, IEEE. humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
[171] Y. Wang, Z. Huang, X. Hong, “Benchmarking deepart detection,” arXiv [192] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, G. Boato, “Raise: A raw
preprint arXiv:2302.14475, 2023. images dataset for digital image forensics,” in Proceedings of the 6th ACM
multimedia systems conference, 2015, pp. 219–224.
[172] Z. Sha, Z. Li, N. Yu, Y. Zhang, “De-fake: Detection and attribution of fake
images generated by text-to-image generation models,” in Proceedings [193] Clip-interrogator, “Clip-interrogator,” 2022. Available: https://github.
of the 2023 ACM SIGSAC Conference on Computer and Communications com/pharmapsychotic/ clip-interrogator.
Security, 2023, pp. 3418–3432. [194] ALASKA, “Alaska.” https://alaska.utt.fr/. Accessed: 2024-08-04.
[173] Z. Xi, W. Huang, K. Wei, W. Luo, P. Zheng, “Ai- generated image [195] OpenAI, “Dall·e 2.” https://openai.com/product/dall-e-2. Accessed: 2024-
detection using a cross-attention enhanced dual-stream network,” in 2023 08-04.
Asia Pacific Signal and Information Processing Association Annual Summit [196] DreamStudio, “Dreamstudio.” https://beta. dreamstudio.ai/generate.
and Conference (APSIPA ASC), 2023, pp. 1463– 1470, IEEE. Accessed: 2024-08-04.
[174] M. A. Rahman, B. Paul, N. H. Sarker, Z. I. A. Hakim, S. A. Fattah, “Artifact: [197] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features
A large-scale dataset with artificial and factual images for generalizable from tiny images.” https://www.cs.utoronto.ca/~kriz/learning-features-
and robust synthetic image detection,” in 2023 IEEE International 2009-TR.pdf, 2009.
Conference on Image Processing (ICIP), 2023, pp. 2200–2204, IEEE. [198] R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, Y. Choi,
[175] S. Jia, M. Huang, Z. Zhou, Y. Ju, J. Cai, S. Lyu, “Autosplice: A text-prompt “Merlot: Multimodal neural script knowledge models,” Advances in
manipulated image dataset for media forensics,” in Proceedings of the neural information processing systems, vol. 34, pp. 23634– 23651, 2021.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, [199] M. Bain, A. Nagrani, G. Varol, A. Zisserman, “Frozen in time: A joint video
pp. 893–903. and image encoder for end- to-end retrieval,” in Proceedings of the IEEE/
[176] X. Guo, X. Liu, Z. Ren, S. Grosz, I. Masi, X. Liu, “Hierarchical fine-grained CVF international conference on computer vision, 2021, pp. 1728–1738.
image forgery detection and localization,” in Proceedings of the IEEE/CVF [200] H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, C. C. Loy,
Conference on Computer Vision and Pattern Recognition, 2023, pp. 3155–3165. “Celebv-hq: A large-scale video facial attributes dataset,” in European
[177] G. Zingarini, D. Cozzolino, R. Corvi, G. Poggi, L. Verdoliva, “M3dsynth: conference on computer vision, 2022, pp. 650–667, Springer.
A dataset of medical 3d images with ai-generated local manipulations,” [201] H. Xue, T. Hang, Y. Zeng, Y. Sun, B. Liu, H. Yang, J. Fu, B. Guo, “Advancing
arXiv preprint arXiv:2309.07973, 2023. high-resolution video-language representation with large-scale video
transcriptions,” in Proceedings of the IEEE/CVF Conference on Computer
[178] R. Shao, T. Wu, Z. Liu, “Detecting and grounding multi-modal media
Vision and Pattern Recognition, 2022, pp. 5036–5045.
manipulation,” in 2023 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2023, pp. 6904–6913. [202] Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y.
Wang, et al., “Internvid: A large-scale video-text dataset for multimodal
[179] R. Amoroso, D. Morelli, M. Cornia, L. Baraldi, A. Del Bimbo, R. Cucchiara,
understanding and generation,” arXiv preprint arXiv:2307.06942, 2023.
“Parents and children: Distinguishing multimodal deepfakes from natural
images,” arXiv preprint arXiv:2304.00500, 2023. [203] J. Yu, H. Zhu, L. Jiang, C. C. Loy, W. Cai, W. Wu, “Celebv-text: A large-
scale facial text-video dataset,” in Proceedings of the IEEE/CVF Conference
[180] H. Cheng, Y. Guo, T. Wang, L. Nie, M. Kankanhalli, “Diffusion facial
on Computer Vision and Pattern Recognition, 2023, pp. 14805–14814.
forgery detection,” arXiv preprint arXiv:2401.15859, 2024.
[204] W. Wang, H. Yang, Z. Tuo, H. He, J. Zhu, J. Fu, J. Liu, “Videofactory: Swap
[181] J. J. Bird, A. Lotfi, “Cifake: Image classification and explainable
attention in spatiotemporal diffusions for text-to-video generation,”
identification of ai-generated synthetic images,” IEEE Access, vol. 12, pp.
https://openreview.net/forum?id=dUDwK38MVC, 2023.
15642–15650, 2024, doi: 10.1109/ACCESS.2024.3356122.
[205] H. Xu, Q. Ye, X. Wu, M. Yan, Y. Miao, J. Ye, G. Xu, A. Hu, Y. Shi, G. Xu, et al.,
[182] M. Zhu, H. Chen, Q. Yan, X. Huang, G. Lin, W. Li, Z. Tu, H. Hu, J. Hu, Y.
“Youku-mplug: A 10 million large-scale chinese video-language dataset
Wang, “Genimage: A million- scale benchmark for detecting ai-generated
for pre-training and benchmarks,” arXiv preprint arXiv:2306.04362, 2023.
image,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[206] W. Wang, Y. Yang, “Vidprom: A million-scale real prompt-gallery dataset
[183] Z. Lu, D. Huang, L. Bai, J. Qu, C. Wu, X. Liu, W. Ouyang, “Seeing is not
for text-to-video diffusion models,” arXiv preprint arXiv:2403.06098, 2024.
always believing: benchmarking human and model perception of ai-
[207] X. Ju, Y. Gao, Z. Zhang, Z. Yuan, X. Wang, A. Zeng, Y. Xiong, Q. Xu,
generated images,” Advances in Neural Information Processing Systems,
Y. Shan, “Miradata: A large-scale video dataset with long durations and
vol. 36, 2024.
structured captions,” arXiv preprint arXiv:2407.06358, 2024.

- 206 -
Regular Issue

[208] T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-W. Chao, B. E. Jeon, Wassim Hamidouche
Y. Fang, H.-Y. Lee, J. Ren, M.-H. Yang, et al., “Panda-70m: Captioning 70m
Wassim Hamidouche is a Principal Researcher at
videos with multiple cross-modality teachers,” in Proceedings of the IEEE/
Technology Innovation Institute (TII) in Abu Dhabi, UAE.
CVF Conference on Computer Vision and Pattern Recognition, 2024, pp.
He also holds the position of Associate Professor at INSA
13320–13331.
Rennes and is a member of the Institute of Electronics
[209] S. Chen, H. Li, Q. Wang, Z. Zhao, M. Sun, X. Zhu, J. Liu, “Vast: A vision-
and Telecommunications of Rennes (IETR), UMR CNRS
audio-subtitle-text omni- modality foundation model and dataset,”
6164. He earned his Ph.D. degree in signal and image
Advances in Neural Information Processing Systems, vol. 36, 2024.
processing from the University of Poitiers, France, in 2010.
[210] R. Girdhar, D. Ramanan, “Cater: A diagnostic dataset for compositional From 2011 to 2012, he worked as a Research Engineer at the Canon Research
actions and temporal reasoning,” arXiv preprint arXiv:1910.04744, 2019. Centre in Rennes, France. Additionally, he served as a researcher at the IRT b<
[211] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, L. Wang, “Git: >com research Institute in Rennes from 2017 to 2022. He has over 180 papers
A generative image-to-text transformer for vision and language,” arXiv published in the field of image processing and computer vision. His research
preprint arXiv:2205.14100, 2022. interests encompass various areas, including video coding, the design of
[212] L. Han, J. Ren, H.-Y. Lee, F. Barbieri, K. Olszewski, S. Minaee, D. Metaxas, software and hardware circuits and systems for video coding standards, image
S. Tulyakov, “Show me what and tell me how: Video synthesis via quality assessment, and multimedia security.
multimodal conditioning,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022, pp. 3615–3625. Abdelmalik Taleb-Ahmed
[213] J. Xu, T. Mei, T. Yao, Y. Rui, “Msr-vtt: A large video description dataset
for bridging video and language,” in Proceedings of the IEEE conference on Abdelmalik Taleb-Ahmed received in 1992 PhD in
computer vision and pattern recognition, 2016, pp. 5288–5296. electronics and microelectronics from université des
[214] D. Chen, W. B. Dolan, “Collecting highly parallel data for paraphrase Sciences et Technologies de Lille 1. He was associate
evaluation,” in Proceedings of the 49th annual meeting of the association professor in Calais until 2004. He joined the Université
for computational linguistics: human language technologies, 2011, pp. 190– Polytechnique des Hauts de France in 2004, where he
200. is presently Full Professor. He joined the laboratory
[215] M. Monfort, S. Jin, A. Liu, D. Harwath, R. Feris, J. Glass, A. Oliva, “Spoken IEMN DOAE. his research focused on computer vision
moments: Learning joint audio-visual representations from video and artificial intelligence and machine vision. His research interests include
descriptions,” in Proceedings of the IEEE/CVF Conference on Computer segmentation, classification, data fusion, pattern recognition, computer vision,
Vision and Pattern Recognition, 2021, pp. 14871–14881. and machine learning, with applications in biometrics, video surveillance,
autonomous driving, and medical imaging. He has (co-)authored over 225
[216] B. C. Hosler, X. Zhao, O. Mayer, C. Chen, J. A. Shackleford, M. C. Stamm,
peer-reviewed papers and (co-)supervised 20 graduate students in these areas
“The video authentication and camera identification database: A new
of research. His recent research revolves mainly around: Enhanced Perception
database for video forensics,” IEEE access, vol. 7, pp. 76937– 76948, 2019.
and HD Mapping in intelligent Transportation, Digitalization of Road and the
[217] Moonvalley, “Moonvalley - ai video generation,” 2024. [Online]. Available:
Signaling, E-Health and Artificial Intelligence, pattern recognition, computer
https://moonvalley.ai/, Accessed: 2024-08-16.
vision, and information fusion, with applications in affective computing,
[218] L. Yang, Y. Fan, N. Xu, “Video instance segmentation,” in Proceedings of the biometrics, medical image analysis, and video analytics and surveillance.
IEEE/CVF international conference on computer vision, 2019, pp. 5188–5197.
[219] L. Huang, X. Zhao, K. Huang, “Got-10k: A large high- diversity benchmark Helena Liz-López
for generic object tracking in the wild,” IEEE transactions on pattern
analysis and machine intelligence, vol. 43, no. 5, pp. 1562–1577, 2019. Helena Liz-López is an Assistant Professor in the
[220] X. Shang, T. Ren, J. Guo, H. Zhang, T.-S. Chua, “Video visual relation Department of Computer Systems Engineering at the
detection,” in Proceedings of the 25th ACM international conference on Universidad Politécnica de Madrid (UPM) and a member
Multimedia, 2017, pp. 1300–1308. of the Natural Language Processing and Deep Learning
(NLP&DL) research group. She holds a degree in Biology
Hessen Bougueffa from the Universidad Autónoma de Madrid and a master’s
degree in bioinformatics and computational biology
Hessen Bougueffa graduated with a Master’s degree from the same university. She obtained her PhD in computer sciences from
in Telecommunication Systems in 2022. His Master’s the Universidad Politécnica de Madrid in 2024, receiving the distinction of
thesis, "Deep Neural Networks Improve Radiologists’ “cum laude." Her research interests include Deep Learning, Machine Learning
Performance in Breast Cancer Screening," set a strong applications in ecology and medicine, and explainable AI.
foundation in applying advanced computational techniques
to real-world problems. Presently, as a Ph.D. candidate at Alejandro Martin
Polytechnic Haute-de-France, Hessen is working on the
development of multimodal models for content characterization in collaboration Alejandro Martin is Associate Professor at Universidad
with the Martini Project. His research is carving a niche at the crossroads of Politécnica de Madrid. His main research interests are Deep
machine learning and content analysis, exploring how various data types can be Learning, Cybersecurity, and Natural Language Processing.
synergistically utilized for enhanced content understanding. Hessen’s work is He has been visiting researcher at theUniversity of Kent
expected to contribute significantly to the fields of artificial intelligence and data and the University of Córdoba. Besides has participated
science, pushing the envelope in multimodal learning approaches. in an important number of international conferences as a
reviewer and organizer, as a reviewer and Guest Editor in
Mamadou Keita international journals, and in a large number of research projects. He is the PI
of different national and international projects focused on the application AI to
Mamadou Keita received his Engineer’s degree in detect and track misinformation in social networks.
Telecommunications and Computer Networks from the
National Institute of Telecommunications and Information
Technology in Oran, Algeria, the Master’s degree in David Camacho
Engineering and Innovation in Images and Networks David Camacho received the Ph.D. degree (with Honors) in
with a specialization in Images from Sorbonne Paris Nord Computer Xcience from Universidad Carlos III de Madrid,
University, France in 2022. He is currently pursuing a Ph.D. in 2001. He is currently a Full Professor with Computer
degree in signal processing at the Institute of Electronics, Microelectronics and Systems Engineering Department, Universidad Politécnica
Nanotechnology, Polytechnic University of Hauts de France, Valenciennes, de Madrid (UPM), Madrid, Spain, and the Head of the
France.His research interests include image quality assessment, object detection Applied Intelligence and Data Analysis research Group,
and tracking, object segmentation, behavior analysis, medical imaging, and UPM. He has authored or coauthored more than 300
multimedia security.

- 207 -
International Journal of Interactive Multimedia and Artificial Intelligence, Vol. 9, Nº1

journals, books, and conference papers. His research interests include machine
learning (clustering/deep learning), computational intelligence (evolutionary
computation, swarm intelligence), social network analysis, fake news and
disinformation analysis. He has participated/led more than 60 research projects
(Spanish and European: H2020, DG Justice, ISFP, and Erasmus+), related to
the design and application of artificial intelligence methods for data mining
and optimization for problems emerging in industrial scenarios, aeronautics,
aerospace engineering, cybercrime/cyber intelligence, social networks
applications, or video games among others.

Abdenour Hadid
Abdenour Hadid received his Doctor of Science in
Technology degree in electrical and information
engineering from the University of Oulu, Finland, in 2005.
Now, he is a Professor in a Chair of excellence at Sorbonne
Center for Artificial Intelligence (SCAI). His research
interests include computer vision, deep learning, artificial
intelligence, internet of things, autonomous driving and
personalized healthcare. He has authored more than 400 papers in international
conferences and journals, and served as a reviewer for many international
conferences and journals. His research works have been well referenced by
the research community with more 25000 citations and an H-Index of 59,
according to Google Scholar. Prof. Hadid was the recipient of the prestigious
“Jan Koenderink Prize” for fundamental contributions in computer vision.

- 208 -

You might also like