lOMoARcPSD|24516919
Seminar Report - 6657
Computer science (Malla Reddy Group of Institutions)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
A SEMINAR REPORT
On
New Age AI : Creating Video from
Text / Sora-OpenAI
Submitted to
MALLA REDDY ENGINEERING COLLEGE
In partial fulfillment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
In
Computer Science and Engineering (AIML)
By
Madhav Sai Tirukovela
Regd. No: 20J41A6657
Under the Guidance of
Mr. K. Dileep Reddy
(Assistant Professor, CSE AIML)
Computer Science and Engineering (AIML)
MALLA REDDY ENGINEERING COLLEGE
(An UGC Autonomous Institution, Approved by AICTE, New Delhi & Affiliated to JNTUH,
Hyderabad).
Maisammaguda(H), Medchal - Malkajgiri District, Secunderabad, Telangana State – 500100,
www.mrec.ac.in
MARCH - 2024
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
MALLA REDDY ENGINEERING COLLEGE
(An UGC Autonomous Institution, Approved by AICTE, New Delhi & Affiliated to JNTUH, Hyderabad).
Maisammaguda(H), Medchal - Malkajgiri District, Secunderabad, Telangana State – 500100,
www.mrec.ac.in
COMPUTER SCIENCE AND ENGINEERING (AIML)
CERTIFICATE
Certified that seminar work entitled “New Age AI : Creating Video from Text /
Sora - OpenAI” is a bonafide work carried out in the 8th semester by Madhav
Sai Tirukovela in partial fulfilment for the award of Bachelor of Technology in
Computer Science and Engineering (Artificial Intelligence & Machine
Learning) from Malla Reddy Engineering College, during the academic year
2023 – 2024. I wish him success in all future endeavors.
Seminar Coordinator Internal Examiner Head of Department
Mr. K Dileep Reddy Dr. U. Mohan Srinivas
Assistant Professor Professor & HOD
Place:
Date:
ii
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
ACKNOWLEDGEMENT
We express our sincere thanks to our Principal, Dr. A. Ramaswami Reddy, who took
keen interest and encouraged us in every effort during the research work.
We express our heartfelt thanks to Dr. U. Mohan Srinivas, Professor and HOD,
Department of Computer Science and Engineering (AIML), for his kind attention and
valuable guidance throughout the research work.
We are thankful to our Seminar Coordinator, Mr. K. Dileep Reddy, Assistant
Professor, Department of Computer Science and Engineering (AIML), for his
cooperation during the research work.
We also thank all the teaching and non-teaching staff of the Department for their
cooperation during the project work.
Madhav Sai Tirukovela
CSE AIML
Regd.No.- 20J41A6657
iii
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
ABSTRACT
The field of artificial intelligence (AI) has witnessed remarkable advancements,
particularly in the realm of natural language processing (NLP) and computer vision.
One of the latest innovations in this domain is the ability of AI systems to create video
content directly from textual input. This groundbreaking capability not only
streamlines the video production process but also opens up new avenues for creativity
and storytelling.
In this paper, we delve into the exciting world of AI-driven video creation from text.
We explore the underlying technologies that make this feat possible, including deep
learning algorithms, neural networks, and multimodal architectures. By analyzing
recent developments and state-of-the-art approaches, we provide insights into the
challenges and opportunities associated with this emerging technology.
Furthermore, we discuss the potential applications and implications of AI-generated
video content across various industries. From personalized marketing videos to
educational tutorials and entertainment media, the ability to translate text into engaging
visuals has transformative potential. We also examine the ethical considerations and
societal impact of AI-driven video creation, highlighting the importance of responsible
AI deployment and algorithmic transparency.
In conclusion, the convergence of AI, NLP, and computer vision is revolutionizing the
way we produce and consume video content. As we navigate this new age of AI-
powered creativity, understanding the capabilities and limitations of text-to-video
technologies is crucial for harnessing their full potential while ensuring ethical and
inclusive practices..
Signature of the
Student
Name : Madhav Sai Tirukovela
Regn. No : 20J41A6657
Semester : 8th
Branch : CSE AIML
Date :
iv
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
TABLE OF CONTENTS
Chapt Description Page No.
er
No.
ABSTRACT iv
TABLE OF CONTENTS v
LIST OF FIGURES vii
LIST OF TABLES viii
1 INTRODUCTION 1
2
1.1 Sora (text - to - video model)
2
2
1.2 History
2 OBJECTIVES 3
2.1 Safety 6
2.2 Research Techniques 7
2.3 Applications 8
3 METHODOLOGY
11
3.1 Inspiration from Large Language Models 11
3.2 Training 11
3.2.1 Video Compression Network 11
3.2.2 Space Time Latent Patches 12
3.2.3 Scaling Transformers for Video Generation 12
3.3 Training Approach 13
3.3.1 Variable Durations, Resolutions, Aspect 13
Ratios
3.3.2 Sampling Flexibility 13
3.3.3 Improved Framing & Composition 13
3.3.4 Language Understanding 14
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
v
4 RESULTS AND DISCUSSIONS 15
4.1 Prompting with images and videos 15
4.2 Image generation capabilities 16
4.3 Emerging Simulation capabilities 17
4.3.1 Three Dimensional Consistency 17
4.3.2 Long range coherence and object permanence 17
4.3.3 Interacting Worlds 17
4.3.4 Simulating Digital Worlds 17
5 CONCLUSION 18
6 FUTURE SCOPE 19
REFERENCES 21
vi
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
LIST OF FIGURES
FIGURE NO. TIT PAGE
LE NO.
1.1 An image generated by given the user prompt 3
3.1 Dimensionality Reduction 11
3.2 Arranging randomly initialized patches in 12
appropriately sized grid
4.1 Input video provided for Video Editing 15
4.2 Transformation of styles and environments 16
4.3 Image generated by Sor of resolution 16
2048x2048
vii
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
LIST OF TABLES
TABLE TIT PAGE NO.
NO. LE
7
2.1 Challenges and Ethical Considerations
2.2 Applications of AI-Generated Video Content 10
viii
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
CHAPTER I
INTRODUCTION
In recent years, artificial intelligence (AI) has undergone a profound
transformation, ushering in a new era of innovation and automation across diverse
industries. One of the most exciting developments in this field is the ability of AI
systems to create video content directly from textual input. This groundbreaking
capability represents a convergence of natural language processing (NLP) and
computer vision, promising to revolutionize the way we produce and consume
visual media.
Traditionally, the creation of video content has been a labor-intensive and time-
consuming process, requiring skilled professionals in videography, editing, and
production. However, advancements in deep learning algorithms and neural
networks have enabled AI systems to understand and interpret textual descriptions,
converting them into rich and dynamic visual sequences.
The concept of generating video from text opens up a multitude of possibilities
across various domains, including marketing, education, entertainment, and
communication. Imagine being able to generate personalized video advertisements
based on customer preferences, or transforming written scripts into immersive
educational videos with interactive visuals.
Moreover, AI-driven video creation has the potential to democratize content
production, allowing individuals and organizations with limited resources to
access professional-quality video content generation tools. This democratization of
visual storytelling not only fosters creativity but also promotes inclusivity by
amplifying diverse voices and narratives.
In this paper, we delve into the fascinating world of AI-powered video creation
from text. We explore the underlying technologies, applications, challenges, and
ethical considerations associated with this emerging paradigm. By examining
recent advancements and real-world use cases, we aim to provide a comprehensive
overview of the transformative potential of "New Age AI: Creating Video from
Text."
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
1.1 SORA (TEXT - TO - VIDEO MODEL)
Sora is a generative artificial intelligence model developed by OpenAI, that
specializes in text-to-video generation. The model accepts textual descriptions, known
as prompts, from users and generates short video clips corresponding to those
descriptions. Prompts can specify artistic styles, fantastical imagery, or real-world
scenarios. When creating real-world scenarios, user input may be required to ensure
factual accuracy, otherwise features can be added erroneously. Sora is praised for its
ability to produce videos with high levels of visual detail, including intricate camera
movements and characters that exhibit a range of emotions. Furthermore, the model
possesses the functionality to extend existing short videos by generating new content
that seamlessly precedes or follows the original clip. As of March 2024, it is
unreleased and not yet available to the public.
1.2 HISTORY OF SORA / OpenAI
Several other text-to-video generating models had been created prior to Sora,
including Meta's Make-A-Video, Runway's Gen-2, and Google's Lumiere, the last of
which, as of February 2024, is also still in its research phase. OpenAI, the company
behind Sora, had released DALL·E 3, the third of its DALL-E text-to-image models,
in September 2023.
The team that developed Sora named it after the Japanese word for sky to signify its
"limitless creative potential". On February 15, 2024, OpenAI first previewed Sora by
releasing multiple clips of high-definition videos that it created, including an SUV
driving down a mountain road, an animation of a "short fluffy monster" next to a
candle, two people walking through Tokyo in the snow, and fake historical footage of
the California gold rush, and stated that it was able to generate videos up to one
minute long. The company then shared a technical report, which highlighted the
methods used to train the model. OpenAI CEO Sam Altman also posted a series of
tweets, responding to Twitter users' prompts with Sora-generated videos of the
prompts.
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
OpenAI has stated that it plans to make Sora available to the public but that it would
not be soon; it has not specified when. The company provided limited access to a
small "red team", including experts in misinformation and bias, to perform adversarial
testing on the model. The company also shared Sora with a small group of creative
professionals, including video makers and artists, to seek feedback on its usefulness in
creative fields.
Fig 1.1 An image generated by Sora given the user prompt.
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
CHAPTER 2
OBJECTIVES
OpenAI’s big model, Sora, can make a whole minute of really good quality video.
The outcomes of their work indicate that making video generation models bigger
is a good way to create versatile simulators for the real world. Sora is a flexible
model for visual data. It can create videos and pictures of different lengths, shapes,
and sizes, even up to a full minute of high-definition video.
Today, Sora is becoming available to red teamers to assess critical areas for harms
or risks. We are also granting access to a number of visual artists, designers, and
filmmakers to gain feedback on how to advance the model to be most helpful for
creative professionals.
Sora is able to generate complex scenes with multiple characters, specific types of
motion, and accurate details of the subject and background. The model
understands not only what the user has asked for in the prompt, but also how those
things exist in the physical world.
The model has a deep understanding of language, enabling it to accurately
interpret prompts and generate compelling characters that express vibrant
emotions. Sora can also create multiple shots within a single generated video that
accurately persist characters and visual style.
The current model has weaknesses. It may struggle with accurately simulating the
physics of a complex scene, and may not understand specific instances of cause
and effect. For example, a person might take a bite out of a cookie, but afterward,
the cookie may not have a bite mark.
The model may also confuse spatial details of a prompt, for example, mixing up
left and right, and may struggle with precise descriptions of events that take place
over time, like following a specific camera trajectory.
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
Understand the Technology: Explore the underlying technologies such as deep
learning algorithms, neural networks, and multimodal architectures that enable AI
systems to create video content from textual input.
Examine Real-World Applications: Analyze and showcase the diverse
applications of AI-generated video content across industries such as marketing,
education, entertainment, and communication, highlighting specific use cases and
success stories.
Assess Advantages and Limitations: Evaluate the advantages and limitations of
AI-driven video creation compared to traditional methods, including aspects such
as efficiency, cost-effectiveness, scalability, and quality of output.
Discuss Ethical and Societal Implications: Discuss the ethical considerations
and societal impact of AI-generated video content, addressing issues such as
algorithmic bias, data privacy, intellectual property rights, and the role of
responsible AI deployment.
Explore Future Trends and Developments: Predict and discuss potential future
trends, advancements, and innovations in the field of text-to-video AI
technologies, including potential improvements in accuracy, realism, and user
customization options.
Promote Awareness and Education: Raise awareness and educate stakeholders,
including industry professionals, researchers, policymakers, and the general
public, about the capabilities, opportunities, and challenges associated with AI-
powered video creation.
Encourage Collaboration and Innovation: Foster collaboration between AI
researchers, content creators, technology developers, and end-users to drive
innovation, co-create new solutions, and unlock the full creative potential of AI-
driven video content creation.
Provide Practical Guidance: Offer practical guidance, best practices, and
recommendations for organizations and individuals looking to integrate AI-
powered video creation tools into their workflows, ensuring effective
implementation, user engagement, and ethical standards compliance.
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
2.1 SAFETY
We’ll be taking several important safety steps ahead of making Sora available in
OpenAI’s products. We are working with red teamers — domain experts in areas
like misinformation, hateful content, and bias — who will be adversarially testing
the model.
We’re also building tools to help detect misleading content such as a detection
classifier that can tell when a video was generated by Sora. We plan to include
C2PA metadata in the future if we deploy the model in an OpenAI product.
In addition to us developing new techniques to prepare for deployment, we’re
leveraging the existing safety methods that we built for our products that use
DALL·E 3, which are applicable to Sora as well.
For example, once in an OpenAI product, our text classifier will check and reject
text input prompts that are in violation of our usage policies, like those that request
extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of
others. We’ve also developed robust image classifiers that are used to review the
frames of every video generated to help ensure that it adheres to our usage policies,
before it’s shown to the user.
We’ll be engaging policymakers, educators and artists around the world to
understand their concerns and to identify positive use cases for this new
technology. Despite extensive research and testing, we cannot predict all of the
beneficial ways people will use our technology, nor all the ways people will abuse
it. That’s why we believe that learning from real-world use is a critical component
of creating and releasing increasingly safe AI systems over time.
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
CHALLENGE/ETHICAL ISSUE DESCRIPTION
Algorithmic Bias Risk of biases in AI models affecting content
generation
Privacy Concerns Protection of user data and sensitive
information
Intellectual Property Copyright and ownership issues related to AI-
generated content
Transparency Ensuring transparency in AI decision-making
processes
Fairness Addressing fairness and inclusivity in content
generation
Accountability Establishing accountability for AI-generated
content
Fig 2.1 Challenges and Ethical Considerations
2.2 RESEARCH TECHNIQUES
Sora is a diffusion model, which generates a video by starting off with one that
looks like static noise and gradually transforms it by removing the noise over
many steps.
Sora is capable of generating entire videos all at once or extending generated
videos to make them longer. By giving the model foresight of many frames at a
time, we’ve solved a challenging problem of making sure a subject stays the same
even when it goes out of view temporarily.
Similar to GPT models, Sora uses a transformer architecture, unlocking superior
scaling performance.
We represent videos and images as collections of smaller units of data called
patches, each of which is akin to a token in GPT. By unifying how we represent
data, we can train diffusion transformers on a wider range of visual data than was
possible before, spanning different durations, resolutions and aspect ratios.
Sora builds on past research in DALL·E and GPT models. It uses the recaptioning
technique from DALL·E 3, which involves generating highly descriptive captions
for the visual training data. As a result, the model is able to follow the user’s text
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
instructions in the generated video more faithfully.
In addition to being able to generate a video solely from text instructions, the
model is able to take an existing still image and generate a video from it,
animating the image’s contents with accuracy and attention to small detail. The
model can also take an existing video and extend it or fill in missing frames. Learn
more in our technical report.
Sora serves as a foundation for models that can understand and simulate the real
world, a capability we believe will be an important milestone for achieving AGI.
2.3 APPLICATIONS
Marketing and Advertising:
Personalized Video Ads: AI can generate personalized video advertisements
based on user preferences, browsing history, and demographic data, enhancing
engagement and conversion rates.
Product Demonstrations: Textual descriptions of products or services can be
transformed into dynamic video demonstrations, showcasing features, benefits,
and usage scenarios.
Education and Training:
Interactive Tutorials: Text-based tutorials can be converted into interactive video
lessons with simulations, quizzes, and feedback mechanisms, fostering active
learning and knowledge retention.
Virtual Classrooms: AI-generated video content can simulate classroom
environments, lectures, and educational modules, enabling distance learning and
remote education initiatives.
Entertainment Industry:
AI-Generated Movies and Shows: Entire movies or series can be conceptualized
and visualized based on textual scripts, characters, settings, and plotlines,
offering new avenues for creative storytelling.
Virtual Reality Experiences: Textual descriptions can be transformed into
immersive VR experiences, interactive narratives, and virtual worlds, enhancing
user immersion and entertainment value.
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
Healthcare and Medical Education:
Medical Simulations: Text-to-video AI can create medical simulations, surgical
procedures, patient case studies, and anatomy tutorials for healthcare
professionals and students.
Patient Education Videos: Complex medical information can be simplified and
visualized through AI-generated video content, aiding patient education,
compliance, and understanding.
Journalism and Media:
Automated News Reports: AI can generate news reports, summaries, and data
visualizations from textual news articles, enhancing newsroom efficiency and
multimedia storytelling.
Data Visualization: Textual data can be transformed into informative and
engaging video infographics, charts, and graphs, aiding in data-driven
storytelling.
Gaming and Interactive Content:
Dynamic Storytelling: Text-based narratives in games can be dynamically
converted into video sequences, enhancing storytelling, character development,
and player immersion. Interactive Game Elements: AI-generated video content
can create interactive game elements, cutscenes, and visual effects, enriching
gameplay experiences.
Virtual Events and Conferences:
Virtual Conferences: AI can create virtual conference environments, keynote
presentations, and interactive sessions based on textual agendas and event
descriptions. Digital Exhibitions: Textual descriptions of products, services, or
artworks can be transformed into virtual exhibitions, tours, and showcases,
enabling online participation and engagement.
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
INDUSTRY APPLICATION
Marketing Personalized video ads, product demonstrations
Education Interactive tutorials, virtual classrooms
Entertainment AI-generated movies, virtual reality experiences
Healthcare Medical simulations, patient education videos
Journalism Automated news reports, data visualization
Gaming Dynamic storytelling, interactive game elements
Virtual Events Virtual conferences, digital exhibitions
Table 2.2 Applications of AI - Generated Video Content
These applications demonstrate the diverse and transformative impact of AI-
driven video creation from text across industries, highlighting opportunities for
innovation, personalization, and enhanced user experiences.
10
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
CHAPTER 3
METHODOLOGY
3.1 INSPIRATION FROM LARGE LANGUAGE MODELS (LLM’s)
Source of Inspiration: The approach is inspired by large language models that
achieve generalist capabilities through training on vast amounts of internet-scale
data.
LLM Paradigm: Large language models, exemplified by LLMs, are successful in
part due to the use of tokens. Tokens serve as a unified representation for diverse
modalities of text, including code, math, and various natural languages.
Fig 3.1 Dimensionality Reduction
3.2 TRAINING
The training of Sora involves video compression, extraction of spacetime latent
patches, and scaling transformers for video generation. Let’s break down each
part:
3.2.1 Video Compression Network
Input: Raw video footage. Objective: The goal of this network is to reduce the
dimensionality of visual data in videos.
Output: A latent representation that is compressed both temporally (across time)
and spatially (across space).
11
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
Training: This network is trained on raw videos to generate a compressed latent
space. This latent space retains essential visual information while reducing
overall complexity.
3.2.2 Space Time Latent Patches
Objective: Extracting meaningful patches from a compressed input video to act
as transformer tokens.
Process: From the compressed video, spacetime patches (considering both
spatial and temporal dimensions) are extracted.
Applicability: This scheme is mentioned to work not only for videos but also for
images, as images are considered videos with a single frame.
Benefits: The patch-based representation allows Sora to be trained on videos and
images with varying resolutions, durations, and aspect ratios.
3.2.3 Scaling Transformers for Video Generation
Model Type: Sora is described as a diffusion model and a diffusion transformer.
Training Objective: Sora is trained to predict the original “clean” patches given
input noisy patches and conditioning information (such as text prompts).
Scaling Properties: Transformers, including diffusion transformers, have
demonstrated effective scaling across various domains, such as language
modeling, computer vision, and image generation. This scalability is crucial for
handling diverse data types and complexities.
Inference Control: During inference (generation), the size of generated videos
can be controlled by arranging randomly-initialized patches in an appropriately-
sized grid.
Fig 3.2 Arranging randomly initialized patches in appropriately sized grid.
12
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
In summary, Sora integrates a video compression network to create a
compressed latent space, utilizes spacetime latent patches as transformer tokens
for both videos and images, and employs a diffusion transformer for video
generation with scalability across different domains. The model is trained to
handle noisy input patches and predict the original “clean” patches, and it allows
control over the size of generated videos during inference.
3.3 TRAINING APPROACH
There are several aspects of the Sora model’s training approach for image and
video generation, emphasizing the advantages of training on data at its native
size. Here’s an explanation:
3.3.1 Variable Durations, Resolutions, Aspect Ratios
Past Approaches: Traditional methods for image and video generation often
involve resizing, cropping, or trimming videos to a standard size (e.g., 4-second
videos at 256x256 resolution).
Native Size Training Benefits: The Sora model opts to train on data at its native
size, avoiding the standardization of duration, resolution, or aspect ratio.
3.3.2 Sampling Flexibility
Wide Range of Sizes: Sora is designed to sample videos with various sizes,
including widescreen 1920x1080p and vertical 1080x1920, offering flexibility
for creating content for different devices directly at their native aspect ratios.
Prototyping at Lower Sizes: This flexibility allows for quick content
prototyping at lower sizes before generating at full resolution, all using the same
model.
3.3.3 Improved Framing Composition
Empirical Observation: Training on videos at their native aspect ratios is
empirically found to improve composition and framing. Comparison to
Common Practice: Comparisons with a model that crops all training videos to
be square (common practice in generative model training) show that the Sora
model tends to have improved framing, avoiding issues where the subject is
only partially in view.
13
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
3.3.4 Language Understanding
Text-to-Video Generation Training: Training text-to-video generation
systems requires a large dataset of videos with corresponding text captions.
Re-Captioning Technique: The re-captioning technique from DALL·E is
applied, involving training a highly descriptive captioner model and using it to
produce text captions for all videos in the training set.
Improvements in Fidelity: Training on highly descriptive video captions is
found to improve text fidelity and overall video quality.
GPT for User Prompts: Leveraging GPT, short user prompts are turned into
longer detailed captions, which are then sent to the video model. This enables
Sora to generate high-quality videos that accurately follow user prompts.
In summary, the Sora model’s approach involves training on data at its native
size, providing flexibility in sampling videos of varying sizes, improving
framing and composition, and incorporating language understanding techniques
for generating videos based on descriptive captions and user prompts.
14
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
CHAPTER 4
RESULTS AND DISCUSSIONS
4.1 PROMPTING WITH IMAGES & VIDEOS
Sora can also be prompted with other inputs, such as pre-existing images or video. This
capability enables Sora to perform a wide range of image and video editing tasks—creating
perfectly looping video, animating static images, extending videos forwards or backwards in
time, etc.
Animating DALL-E Images
Sora is capable of generating videos provided an image and prompt as input.
Extending Generated Videos
Sora is also capable of extending videos, either forward or backward in time. Below are three
videos that were all extended backward in time starting from a segment of a generated video.
We can use this method to extend a video both forward and backward to produce a seamless
infinite loop.
Video - to - Video Editing
Diffusion models have enabled a plethora of methods for editing images and videos from text
prompts. Below we apply one of these methods, SDEdit,32 to Sora. This technique enables Sora
to transform the styles and environments of input videos zero-shot.
Fig 4.1 Input video provided for Video Editing
15
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
Fig 4.2 Transformation of styles and environments
Connecting Videos
We can also use Sora to gradually interpolate between two input videos, creating
seamless transitions between videos with entirely different subjects and scene
compositions.
4.2 IMAGE GENERATION CAPABILITIES
Sora is also capable of generating images. We do this by arranging patches of
Gaussian noise in a spatial grid with a temporal extent of one frame. The model
can generate images of variable sizes—up to 2048 x 2048 resolution.
16
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
Fig 4.3 Image generated by Sora of resolution 2048 x 2048
4.3 EMERGING SIMULATION CAPABILITIES
We find that video models exhibit a number of interesting emergent capabilities
when trained at scale. These capabilities enable Sora to simulate some aspects of
people, animals and environments from the physical world. These properties
emerge without any explicit inductive biases for 3D, objects, etc.—they are purely
phenomena of scale.
4.3.1 Three-Dimensional consistency.
Sora can generate videos with dynamic camera motion. As the camera shifts and
rotates, people and scene elements move consistently through three-dimensional
space.
4.3.2 Long-range coherence and object permanence.
A significant challenge for video generation systems has been maintaining
temporal consistency when sampling long videos. We find that Sora is often,
though not always, able to effectively model both short- and long-range
dependencies. For example, our model can persist people, animals and objects
even when they are occluded or leave the frame. Likewise, it can generate multiple
shots of the same character in a single sample, maintaining their appearance
throughout the video.
4.3.3 Interacting with the world. Sora can sometimes simulate actions that
affect the state of the world in simple ways. For example, a painter can leave new
strokes along a canvas that persist over time, or a man can eat a burger and leave
bite marks.
4.3.4 Simulating digital worlds.
Sora is also able to simulate artificial processes–one example is video games. Sora
can simultaneously control the player in Minecraft with a basic policy while also
rendering the world and its dynamics in high fidelity. These capabilities can be
elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”
These capabilities suggest that continued scaling of video models is a promising
path towards the development of highly-capable simulators of the physical and
digital world, and the objects, animals and people that live within them.
17
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
CHAPTER 5
CONCLUSION
The evolution of artificial intelligence (AI) has brought forth a transformative era
in video content creation, where text can be seamlessly translated into captivating
visual narratives. The exploration of "New Age AI: Creating Video from Text"
has illuminated the tremendous potential, challenges, and ethical considerations
inherent in this cutting-edge technology.
As we reflect on the discussed objectives, it is evident that AI-driven video
creation holds immense promise across diverse sectors. From personalized
marketing campaigns that resonate with individual preferences to immersive
educational experiences that transcend traditional boundaries, the applications of
text-to-video AI are vast and impactful.
Sora marks a significant stride forward in the realm of AI-generated video content.
Its unique capabilities and user-friendly features open doors for content creators,
educators, and businesses to explore new dimensions in visual storytelling. As
Sora continues to evolve, it holds the promise of transforming the way we
perceive and create videos in the digital landscape.
Looking ahead, the future of AI in video creation promises continued innovation
and refinement. Anticipated advancements in accuracy, realism, and
customization options will further enhance the quality and user experience of AI-
generated video content. Collaboration between researchers, developers, content
creators, and stakeholders will play a pivotal role in driving these advancements
and ensuring ethical AI deployment.
We believe the capabilities Sora has today demonstrate that continued scaling of
video models is a promising path towards the development of capable simulators
of the physical and digital world, and the objects, animals and people that live
within them.
In conclusion, "New Age AI: Creating Video from Text" represents a significant
milestone in the evolution of AI and visual storytelling. By embracing the
potential of AI-driven video creation while upholding ethical standards and
fostering collaboration, we can unlock new realms of creativity, engagement, and
18
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
inclusivity in the digital landscape.
19
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
CHAPTER 6
FUTURE SCOPE
The future scope for "New Age AI: Creating Video from Text" is promising and
expansive, with several potential avenues for growth, innovation, and impact.
Here are some key areas of future scope:
Enhanced Realism and Immersion: AI algorithms will continue to evolve,
leading to improvements in generating video content that is highly realistic and
immersive. Advances in natural language understanding, computer vision, and
graphics rendering will contribute to more lifelike visuals and seamless
integration of text-based narratives into video formats.
Interactive and Personalized Experiences: Future developments may enable AI
systems to create interactive video content that responds dynamically to viewer
inputs or preferences. Personalization algorithms could tailor video narratives
based on individual user profiles, enhancing engagement and relevance.
Multimodal Fusion: The integration of multiple modalities such as text, audio,
images, and video will enable AI systems to create rich, multimodal storytelling
experiences. This could lead to innovative multimedia presentations, virtual
reality (VR) experiences, and mixed-reality content that blurs the line between
physical and digital worlds.
Cross-Domain Applications: AI-powered video creation from text will find
applications beyond traditional sectors such as marketing and education.
Industries like healthcare, journalism, gaming, and virtual events may leverage
this technology for purposes such as medical education simulations, news
reporting, interactive storytelling in games, and virtual conferences.
Ethical AI and Bias Mitigation: Continued efforts will be made to address
ethical concerns related to AI-generated content, including bias mitigation,
fairness, transparency, and accountability. Development of ethical AI
frameworks, bias detection algorithms, and responsible AI practices will be
crucial for fostering trust and societal acceptance.
20
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
Collaborative Creation Platforms: Future platforms may emerge that facilitate
collaborative creation of AI-generated video content, enabling teams of creators,
designers, and AI experts to collaborate seamlessly. These platforms could
integrate version control, real-time editing, and feedback mechanisms to
streamline the content creation workflow.
Education and Skill Development: As AI-driven video creation becomes more
accessible, education and skill development programs will play a vital role in
equipping individuals with the knowledge and expertise to harness this
technology effectively. Training initiatives, online courses, and workshops
focused on AI content creation will empower a new generation of creators and
storytellers.
Regulatory Frameworks and Standards: Governments and regulatory bodies
may develop frameworks and standards to govern the use of AI in video content
creation, ensuring compliance with legal requirements, privacy norms, and ethical
guidelines. Industry collaborations and self-regulatory initiatives will also
contribute to shaping responsible AI practices.
In summary, the future scope for "New Age AI: Creating Video from Text" is
characterized by continuous innovation, interdisciplinary collaborations, ethical
considerations, and a wide range of potential applications across various domains.
Embracing these opportunities while addressing challenges will pave the way for
a dynamic and inclusive AI-powered content creation landscape.
21
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
REFERENCES
1. Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. "Unsupervised
learning of video representations using lstms." International conference on machine
learning. PMLR, 2015.↩︎
2. Chiappa, Silvia, et al. "Recurrent environment simulators." arXiv preprint
arXiv:1704.02254 (2017).↩︎
3. Ha, David, and Jürgen Schmidhuber. "World models." arXiv preprint arXiv:1803.10122
(2018).↩︎
4. Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with
scene dynamics." Advances in neural information processing systems 29 (2016).↩︎
5. Tulyakov, Sergey, et al. "Mocogan: Decomposing motion and content for video
generation." Proceedings of the IEEE conference on computer vision and pattern
recognition. 2018.↩︎
6. Clark, Aidan, Jeff Donahue, and Karen Simonyan. "Adversarial video generation on
complex datasets." arXiv preprint arXiv:1907.06571 (2019). ↩︎
7. Brooks, Tim, et al. "Generating long videos of dynamic scenes." Advances in Neural
Information Processing Systems 35 (2022): 31769-31781.↩︎
8. Yan, Wilson, et al. "Videogpt: Video generation using vq-vae and transformers."
arXiv preprint arXiv:2104.10157 (2021).↩︎
9. Wu, Chenfei, et al. "Nüwa: Visual synthesis pre-training for neural visual world
creation." European conference on computer vision. Cham: Springer Nature
Switzerland, 2022.↩︎
10. Ho, Jonathan, et al. "Imagen video: High definition video generation with diffusion
models." arXiv preprint arXiv:2210.02303 (2022).↩︎
11. Blattmann, Andreas, et al. "Align your latents: High-resolution video synthesis with
latent diffusion models." Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2023.↩︎
12. Gupta, Agrim, et al. "Photorealistic video generation with diffusion models." arXiv
preprint arXiv:2312.06662 (2023).↩︎
13. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information
processing systems 30 (2017).↩︎
↩︎
14. Brown, Tom, et al. "Language models are few-shot learners." Advances in neural
information processing systems 33 (2020): 1877-1901.↩︎
↩︎
15. Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image
recognition at scale." arXiv preprint arXiv:2010.11929 (2020). ↩︎
↩︎
22
Downloaded by Adnan Ali (adnanali078601@gmail.com)
lOMoARcPSD|24516919
16. Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the
IEEE/CVF international conference on computer vision. 2021. ↩︎
↩︎
17. He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition. 2022. ↩︎
↩︎
18. Dehghani, Mostafa, et al. "Patch n'Pack: NaViT, a Vision Transformer for any
Aspect Ratio and Resolution." arXiv preprint arXiv:2307.06304 (2023). ↩︎
↩︎
19. Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models."
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2022.↩︎
20. Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv
preprint arXiv:1312.6114 (2013).↩︎
21. Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium
thermodynamics." International conference on machine learning. PMLR, 2015. ↩︎
22. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models."
Advances in neural information processing systems 33 (2020): 6840-6851. ↩︎
23. Nichol, Alexander Quinn, and Prafulla Dhariwal. "Improved denoising diffusion
probabilistic models." International Conference on Machine Learning. PMLR, 2021. ↩︎
24. Dhariwal, Prafulla, and Alexander Quinn Nichol. "Diffusion Models Beat GANs on
Image Synthesis." Advances in Neural Information Processing Systems. 2021. ↩︎
25. Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models."
Advances in Neural Information Processing Systems 35 (2022): 26565-26577. ↩︎
26. Peebles, William, and Saining Xie. "Scalable diffusion models with transformers."
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. ↩︎
27. Chen, Mark, et al. "Generative pretraining from pixels." International conference on
machine learning. PMLR, 2020.↩︎
28. Ramesh, Aditya, et al. "Zero-shot text-to-image generation." International Conference
on Machine Learning. PMLR, 2021.↩︎
29. Yu, Jiahui, et al. "Scaling autoregressive models for content-rich text-to-image
generation." arXiv preprint arXiv:2206.10789 2.3 (2022): 5. ↩︎
30. Betker, James, et al. "Improving image generation with better captions." Computer
Science. https://cdn.openai.com/papers/dall-e-3. pdf 2.3 (2023): 8 ↩︎
↩︎
31. Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip
latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3.↩︎
32. Meng, Chenlin, et al. "Sdedit: Guided image synthesis and editing with stochastic
differential equations." arXiv preprint arXiv:2108.01073 (2021).↩︎
23
Downloaded by Adnan Ali (adnanali078601@gmail.com)