EDIQUICK-DYNAMIC VIDEO EDITING WITH
INTEGRETED MUSIC
Paras Damale Sidhesh Patil Rajwardhan Repe
CSE (AIML) CSE (AIML) CSE (AIML)
KIT’s College of Engineering Kolhapur KIT’s College of Engineering Kolhapur KIT’s College of Engineering Kolhapur
Maharashtra, India Maharashtra, India Maharashtra, India
damamleparas@gmail.com sidheshpatil668@gmail.com reperajwardhan@gmail.com
Bhavesh Ahuja Ujwala Salunkhe Uma Gurav
CSE (AIML) CSE (AIML) CSE (AIML)
KIT’s College of Engineering Kolhapur KIT’s College of Engineering Kolhapur KIT’s College of Engineering Kolhapur
Maharashtra, India Maharashtra, India Maharashtra, India
bhaveshahuja0302@gmail.com gurav.uma@kitcoek.in
Abstract—Video editing plays a pivotal part in transubstanti- important aspect is the ability to tailor content for different
ating raw footage into polished, engaging content by enhancing audiences, including those with accessibility needs, by adding
visual and audile rudiments. Despite its significance, traditional subtitles, captions, and audio descriptions [1] [2].
videotape editing is frequently time- ferocious and requires
advanced specialized chops. This paper introduces EDIQUICK, Despite its importance, manual video editing presents sev-
an automated videotape editing result that integrates audio eral challenges that can affect efficiency and quality. The
processing, natural lan- guage processing( NLP), and sentiment- process is time-consuming, particularly with longer projects,
driven advancements to streamline the editing workflow. The as editors spend a lot of time cutting, trimming, and applying
system reduces homemade trouble by enforcing silence discovery, effects. This can also result in prolonged rendering times
audio- to- textbook transcrip- tion, sentiment analysis, and
environment- grounded image and music objectification. By au- for high-resolution videos. Additionally, the complexity of
tomating repetitious tasks similar as trouncing, synchronization, advanced editing software can be daunting for newcomers,
and visual effect operation, EDIQUICK offers a stoner-friendly requiring significant learning to master. The editing workflow
and effective platform for neophyte and professional videotape can be disjointed, with multiple tasks—such as color grading,
editors. Results demonstrate the system’s capability to produce transitions, and audio adjustments—handled across different
cohesive and visually compelling content with reduced processing
time and enhanced usability. tools. Errors can occur due to reliance on human skills,
Index Terms—Automation, Video Editor, Silence Detection, such as improperly synced audio or missed effects, leading
Music Generation to inconsistencies. Collaboration also becomes challenging
when managing various project versions or sharing large files.
I. I NTRODUCTION Finally, the subjective nature of creative decisions means that
Video editing is essential in transforming raw footage into the outcome depends heavily on the editor’s experience and
a compelling and professional final product. It enhances the artistic vision, which can vary significantly across different
video’s visual quality by incorporating effects and eliminating individuals [1] [3].
unnecessary segments. Editing also facilitates the inclusion Our study addresses some of the key challenges in tradi-
of graphics, animations, and text overlays, which help create tional video editing through an algorithmic solution. One of
a more dynamic presentation. This is particularly vital in the main issues we tackle is the time spent on cutting and trim-
business contexts, as high-quality video editing contributes to ming silent sections of videos. By using a our algorithm that
building a credible and polished brand image. Additionally, operates on audio graphs, we reduce this time significantly,
editing makes the audio and visuals are well-aligned, deliver- enabling faster processing. This audio-focused approach is
ing a seamless viewing experience [1]. also more efficient than handling video data directly. Fur-
Beyond the basics, video editing empowers creators with thermore, our method simplifies video editing, eliminating the
full control over the narrative structure and style of the content. need for complex software knowledge. It integrates various
Techniques such as using multiple camera angles and split- editing steps—such as transitions, color adjustments, and
screen views provide diverse perspectives, which is useful for audio synchronization—into a unified process, enhancing user-
conveying complex concepts. Moreover, adding elements like friendliness. Since the algorithm automates much of the work,
images, animations, and infographics can present information the likelihood of human errors like unsynchronized audio
clearly and engagingly, making video editing indispensable or missed effects is minimized. Users still maintain control
in educational, promotional, and corporate media. Another over specific edits, allowing them to apply custom effects
or transitions, while the system handles repetitive tasks. This
approach not only helps the editing process but also makes
video editing more accessible and efficient for both novice and
experienced users. The paper is structured as follows: Section
2 reviews related work in the field. Section 3 details the
proposed approach. Section 4 presents the experimental setup
and results. Section 5 provides a summary of the findings, and
Section 6 concludes the study.
II. R ELATED W ORKS
Recent advancements in deep learning and AI offer sub-
stantial insights for our proposed video editor. Kalingeri and
Grandhe [4] improved music generation by incorporating
convolutional layers into LSTM models, which enhances audio
processing capabilities and produces smoother outputs. Con-
ner et al. [4] also employed LSTM networks, focusing on
sequential learning from musical patterns, though they faced
challenges with long-range dependencies and noise reduction.
Huang et al. [5] expanded on this by combining CNNs with
LSTMs, resulting in a model that better captures local and
sequential patterns for generating expressive music, relevant
Fig. 1. Workflow diagram
to our sentiment-based background music selection.
In video editing, tools like those developed by Soe [6] have
automated basic tasks like summarization but fall short on visual flow. Using the timestamps generated by the silence
more intricate editing needs. Our approach bridges this gap, detection algorithm, the trim function cuts out the correspond-
integrating sentiment and entity recognition to enable precise ing silent portions of the video, ensuring that it retains its
trimming and context-specific enhancements. intended narrative without unnecessary gaps. After trimming,
Further, Mishra’s [7] Ice Cube Algorithm improves audio- the video is enhanced by applying predefined visual effects
video synchronization in multilingual contexts through adap- through the effects function, adding stylistic consistency. The
tive time-stretching, inspiring our methods for aligning audio enhanced, trimmed video is then sent back to the separation
to the video’s emotional tone. In speech recognition, the Whis- function for further audio analysis. In this stage, the audio
per model [4] shows adaptability across dialects, supporting data is converted into text using an audio-to-text function,
our transcription processes and NLP-driven analysis. creating a written transcription of the video’s spoken content.
This table format summarizes the key contributions of each This transcription serves as a basis for sentiment analysis
referenced work and shows how they relate to the proposed and further NLP-driven processing. The sentiment analysis
video editing system function evaluates the transcribed text, identifying the overall
emotional tone conveyed in each segment of dialogue or
III. P ROPOSED W ORK
narration. Based on this sentiment, adjustments are made
The automated video editor operates through a series that the background music selection to match the video’s
of well-defined stages that enable efficient, sentiment-driven mood, creating a coherent and engaging viewer experience.
video editing by integrating audio processing, natural language In parallel, proper nouns identified during transcription are
processing (NLP), and video effects. Each step of this work- extracted using NLP techniques as in [8]. These extracted
flow is outlined as follows. entities, such as names, locations, or notable terms, are then
The process begins with the user uploading a video file passed to an image function that interfaces with the Pixel
through a web-based interface. Once uploaded, the video is Image API to retrieve relevant images associated with each
temporarily stored on the server and ready for processing. entity. These images are incorporated into the video to provide
Initially, the audio track is extracted from the video and dupli- visual context, enriching the narrative with supportive imagery.
cated, preparing it for various stages of analysis. The extracted The final stage of this methodology combines all processed
audio is then passed to a separation function that isolates elements into the final product. The effects-enhanced video,
specific segments, marking the beginning of the processing selected background music, and retrieved images are all fed
pipeline. Following this, the isolated audio is analyzed using into a concatenate function, which merge the elements seam-
a silence detection algorithm. This algorithm identifies silent lessly into one cohesive video. This fully edited video is then
intervals within the audio track by measuring amplitude levels exported to the website interface, where the user can access
and locating sections that fall below a certain threshold. The and download the completed product, fully automated from
timestamps for these silent segments are recorded and used upload to final delivery.
to trim the video, thus reducing pauses and streamlining the
Author and paper name Methods used Performance criteria Limitations
Vasanti Kalingeri, Srikanth LSTM Networks, Fully Con- Generated more pleasant mu- Challenges in capturing long-
Grandhe: Music Generation Using nected and Convolutional Lay- sic compared to other models range dependencies and reduc-
Deep Learning ers ing noise
Than Htut Soe: AI in Video Editing AI Tools (Silver, Roughcut), Can automate tedious edit- Lacks flexibility for compre-
Shot Detection, Clip Compo- ing tasks, generate video sum- hensive editing, AI struggles
sition, Audio Synchronization maries with complex tasks
Yongjie Huang, Xiaofeng Huang, CNN-LSTM model combining Accuracy of 97.81%, handles Constant frequency distribu-
Qikai Cai: Music Generation Based CNN for feature extraction, complex and expressive music tion of generated music, lacks
on Convolution-LSTM LSTM for sequential data expressiveness in real-world
music
Rajwant Mishra: Adaptive Time Ice Cube Algorithm for adap- Preserves natural sound while Challenges in handling syn-
Stretching AI Algorithm for Video tive time-stretching fitting translated audio within chronization across multiple
Editing after AI Translation original video timing translations
Satvika Nidhya Kiran, Yash SpaCy NLP Library, Streamlit Enhances accessibility for GUI needs further enhance-
Munnalal Gupta: POS Analyzer GUI, Speech-to-Text, Text-to- ESL learners in understanding ments for ease of use, lacks
using NLP for ESL Learners Speech grammar advanced features for detailed
grammar teaching
Muhammad Asadullah , Shibli RMS-based method for si- Achieves 97.2 percent ac- Potential for further improve-
Nisar.: A Silence Removal and lence removal, bandpass filter- curacy, significantly reduces ment by adding additional fea-
Endpoint Detection in Speech Pro- ing, frame division for RMS bandwidth and processing time tures for enhanced accuracy
cessing comparison, endpoint detec- by cutting down silent and un- and robustness in varied appli-
tion for voiced speech seg- voiced segments. cations.
ments.
Ruilin Xu et al.: ”Listening to Deep learning-based model Outperforms existing methods Requires accurate silent inter-
Sounds of Silence for Speech De- utilizing silent interval detec- on denoising metrics like val detection; trained primar-
noising” tion to isolate noise character- PESQ, SSNR, and STOI; ily on English, with possi-
istics; includes convolutional shows robust generalization ble limitations in generalizing
layers, bidirectional LSTMs, across various noise types and across languages with different
and spectrogram processing languages. phonological structures
for noise removal.
Devandran Govender: Investigating This study uses the pyAudio- The classification model’s per- The study did not develop
audio classification to automate the Analysis library to classify au- formance was measured using a custom classification system
trimming of recorded lectures dio files into speech and non- Accuracy, Precision, Recall, but used an open-source li-
speech categories, leveraging and F-Measure, yielding high brary for audio feature extrac-
a Support Vector Machine results (Accuracy: 97.9%, Pre- tion and classification. Addi-
(SVM) model. The model was cision: 98.7%, Recall: 97.1%, tionally, the model may mis-
trained and tested on 6,862 and F-Measure: 97.9%). classify audio with high ambi-
audio files, with segmentation ent noise, affecting accuracy.
and classification applied to
distinguish speech from back-
ground noise.
Dave Rodriguez and Bryan J. ASR tools evaluated through Whisper AI achieved the high- Inconsistent performance
Brown: ”Comparative Analysis of WER analysis on diverse AV est accuracy with an aver- across audio contexts; racial
Automated Speech Recognition materials. Tools include Whis- age WER of 1.3%, followed bias with higher WER for
Technologies for Enhanced per AI, Microsoft Stream, by Rev API at 2.5%, Mi- African American speakers
Audiovisual Accessibility” AWS Transcribe, and Rev API, crosoft Stream at 3.9%, and due to limited model diversity.
with a focus on accessibility AWS Transcribe with the low-
and accuracy. est accuracy at 4.5%. Results
indicate Whisper AI’s supe-
rior performance across vari-
ous content types.
Vlad-George Prundus et al.: ”An Two text-based emotion recog- Text-based approaches had fair Limited accuracy in identify-
Inter-Agreement Study of Auto- nition methods, ParallelDots agreement with each other, ing complex or nuanced emo-
matic Emotion Recognition Tech- API and text2emotion library, achieving a Fleiss Kappa score tions, as well as challenges in
niques from Different Inputs” were used to identify emotions of 0.2566. ParallelDots had an processing text transcriptions
from text derived from tran- overall accuracy of 26%, and that reflect multiple emotions
scriptions. ParallelDots uses a text2emotion reached 16% on or subtle expressions.
deep learning model, while the selected dataset.
text2emotion uses a lexicon-
based approach.
TABLE I
L ITERATURE REVIEW
The methodology for the automated video editor is or-
ganized into the following subsections: Data Collection and
Input Handling, where the user uploads a video through a
web interface; Audio Extraction and Processing, involving Fig. 2. graph of audio before squaring
the extraction of audio, separation, and silence detection;
Video Trimming and Effects Application, which uses silence
timestamps to trim the video and apply effects; Audio-to-
Text Conversion and Text Analysis, transcribing audio to text,
performing sentiment analysis, and extracting proper nouns;
Image Retrieval and Background Music Selection, retrieving
Fig. 3. graph of audio after squaring
images based on extracted nouns and selecting music based on
sentiment; and Video Assembly and Output Delivery, where
the video, images, and background music are combined into After squaring, an empty array is created to store binary in-
a final output that is made available to the user. dicators for each audio sample. A threshold of 0.1 is applied to
these squared values: samples above the threshold are marked
IV. M ETHODOLOGY as 1 (indicating speech), while samples below are marked as 0
The methodology for the automated video editor is orga- (indicating silence) as demonstrated in audio thresholding by
nized into the following subsections: [9]. This threshold was determined empirically to balance the
detection of silence with speech clarity, minimizing the risk
A. Input Handling of softer speech being classified as silence.
With this binary array, the algorithm identifies silence in-
The video input begins when the user selects a video file
tervals by detecting continuous sequences of 0s. If a sequence
through an interactive web interface built with HTML and
of 0s exceeds 3 seconds, the algorithm records the start and
CSS. This interface is designed to be simple and responsive,
end indices of that interval, marking it as a silent segment.
ensuring an easy upload experience. After the user uploads
These indices are stored in an array, providing a clear record of
the video, the frontend sends the file to the backend using an
silence intervals within the audio track following the approach
HTTP request.
detailed by [9] [10]for segment-based silence detection.
On the backend, Django handles the video file by tem-
Finally, using moviepy.editor’s VideoFileClip function, the
porarily storing it in a specified directory on the server. The
video is trimmed according to these identified silence intervals.
uploaded video is then forwarded to the main processing
The start and end indices are applied to remove sections of
function, where it undergoes further processing for editing
the video that correspond to extended silence, thus producing
and analysis. This system streamlines the process, efficiently
a refined and engaging final video. This approach ensures
managing the flow of video from user input to preparation for
that unnecessary pauses are minimized, improving continuity
the next stages in the pipeline.
and enhancing viewer engagement as outlined in video editing
automation.
B. Audio Extraction and Processing
1) Audio separation: To extract the audio from the video, C. Effects Application
the VideoFileClip function from the moviepy.editor library is In the Effect Application phase of the proposed video edit-
utilized. This function loads the video file and provides access ing system, a variety of visual effects are applied to enhance
to its audio track. The audio is then isolated and a copy is the video’s quality and tailor it to user preferences. These
created, enabling further independent processing of the audio effects include black-and-white conversion, color adjustment,
without altering the original video. This extracted audio can be fade-in and fade-out transitions, gamma correction, margin
used for various tasks, such as silence detection, transcription, addition, and speed modification. To ensure flexibility and user
or sentiment analysis, while rollowing the integrity of the control, some effects are mutually exclusive, such as black-
video content for subsequent editing. and-white and color adjustments, where users are prompted
2) Silence removel & Video Trimming: To implement effec- to select one. The fade-in and fade-out effects create smooth
tive silence detection, an algorithm was developed that utilizes transitions between scenes, while gamma correction adjusts the
both librosa and numpy for efficient audio processing. First, brightness and contrast for improved visual appeal. The margin
the audio track is loaded using the librosa.load function, which effect adds a border around the video, and the speed modifica-
provides the amplitude values of the audio signal along the tion alters the playback rate to either slow down or speed up
y-axis. Each amplitude value is then squared, which helps the video. These effects are applied using the MoviePy library,
to amplify the difference between silence (low amplitude) which provides advanced video manipulation functions such
and speech (high amplitude). Squaring the values enhances as vfx.blackwhite, vfx.colorx, fadein(), fadeout(), vfx.gamma,
separation, making noise and silence, represented by smaller and margin(). This methodology offers a user-friendly yet
values, appear even lower, while boosting values representing powerful way to apply a diverse set of visual effects, ensuring
speech. a high-quality and customized video output.
D. Audio-to-Text Conversion and Text Analysis incorporated to mitigate overfitting. A final dense layer with
The trimmed audio is processed using OpenAI’s Whisper a softmax activation function was used to classify and predict
model in an audio-to-text function [11], yielding a result the next note in the sequence. The model was trained with
that includes several key attributes essential for detailed tran- a learning rate of 0.001 and a categorical crossentropy loss
scription analysis. These attributes are: Transcription, which function. The training process spanned 50 epochs with a batch
provides a textual representation of the spoken content [12] size of 64 and employed early stopping to avoid overfitting.
Timestamps, marking the exact time each word or phrase ap- the system accepted a short seed sequence of notes as input
pears within the audio, allowing precise alignment, Language to initiate the music generation process. The trained model
Detection, identifying the spoken language, which is beneficial generated subsequent notes iteratively by predicting the most
in multilingual datasets. Segmentation, breaking down long probable next note based on the input sequence. A sampling
recordings into manageable chunks to improve readability and strategy, such as argmax or temperature sampling, was used
usability. These attributes combined within the result enhance to select notes from the output probabilities. The generated
the model’s effectiveness in transcription across diverse audio sequence was then converted back into a MIDI format for
environments and languages. The transcribed text is then playback, enabling the creation of complete background music
analyzed using sentiment analysis to identify the emotional tracks.
tone similar to [13], and natural language processing (NLP) G. Video Assembly and Output Delivery
techniques are applied to extract proper nouns [14].
Once all the steps are completed, such as applying video
E. Image Retrieval effects, collecting images with appropriate timestamps, getting
The system retrieves relevant images by querying the Pixel sentiment-based background music, the individual elements
Image Dataset API using the keywords and proper nouns are passed to a concatenation function. This function combines
identified during the text analysis of the transcribed audio. the video, images, and audio into a seamless final output.
Each retrieved image is associated with a specific keyword Using Django’s backend, the final video is then prepared and
or phrase, and they are stored in an array for later use. stored on the server. The processed video is subsequently made
The images are tagged with corresponding timestamps from available to the user through the web interface, where they
the transcription, ensuring that they are aligned with the can preview and download the video. Django facilitates this
appropriate moments in the video. This approach provides a workflow by managing file storage, handling the interaction
more dynamic and contextually rich experience, as the images between the front-end and the back-end, and ensuring the
visually represent the spoken content in the video. video is delivered to the user in an efficient and seamless
manner.
F. Background Music Selection
V. R ESULTS AND D ISCUSSIONS
The first step involved gathering a comprehensive dataset of
instrumental background music. The data was sourced from The results of the EDIQUICK system demonstrate signif-
publicly available MIDI files from online repositories and icant improvements in video editing efficiency and content
open-source music databases quality. The silence detection module effectively identifies
here we need to tell about the mirex dataset These files were and removes silence intervals from the audio, achieving a
chosen to represent a variety of musical genres and moods 30% reduction in video duration. As shown in the raw audio
we have reduced the features from Boisterous, Confident, waveform (Figure 1), silent regions characterized by flat am-
Passionate, Rousing, Rowdy, Amiable-good natured, Cheerful, plitude were successfully detected and removed, resulting in
Fun, Rollicking, Sweet, Autumnal, Bittersweet, Brooding, Lit- a trimmed audio waveform (Figure 2) with smooth transitions
erate, Poignant, Wistful, Campy, Humorous, Silly, Whimsical, and uninterrupted flow. This optimization enhances viewer
Witty, Wry, Aggressive, Fiery, Intense, Tense - Anxious, engagement by eliminating unnecessary pauses.
Visceral, Volatile.to suitable for background music,such as
sad,happy,energetic,neutral,peaceful.
here instead we need to specify the classes of emotions
Each MIDI file consisted of multi-track sequences encoded
with information about musical notes. These notes were
further serialized into numerical arrays, where each value
corresponded to the pitch and duration of a note within a
fixed time step. This representation allowed the sequential
data to be fed effectively into the machine learning model.
The neural network architecture for the model was based on
Long Short-Term Memory (LSTM) networks similar to [15]
[16], The architecture included two LSTM layers, each with
256 neurons, to capture temporal dependencies in the musical Fig. 4. Before trimming
sequences. Dropout layers with a dropout rate of 20% were
[3] Haolin Xie and Yu Sun, “An intelligent video editing automate
framework using ai and computer vision,” Available at SSRN 4177685,
vol. 12, no. 12, 2022.
[4] Siyin Wang, Chao-Han Yang, Ji Wu, and Chao Zhang, “Can whisper
perform speech-based in-context learning?,” in ICASSP 2024-2024 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2024, pp. 13421–13425.
[5] Michael Conner, Lucas Gral, Kevin Adams, David Hunger, Reagan
Strelow, and Alexander Neuwirth, “Music generation using an lstm,”
arXiv preprint arXiv:2203.12105, 2022.
[6] Than Htut Soe and M Slavkovik, “Ai video editing tools,” want and
how far is AI from delivering them, 2021.
[7] Rajwant Mishra, “Adaptive time stretching ai algorithm for video editing
after ai translation,” Authorea Preprints, 2024.
[8] Satwika Nindya Kirana and Yash Munnalal Gupta, “Utilising natural
Fig. 5. After trimming language processing to assist esl learners in understanding parts of
speech,” .
[9] G Saha, Sandipan Chakroborty, and Suman Senapati, “A new silence
removal and endpoint detection algorithm for speech and speaker
Furthermore, the music detection module utilizes sentiment recognition applications,” in Proceedings of the 11th national conference
analysis to select background music that aligns with the on communications (NCC), 2005, pp. 291–295.
emotional tone of the video, achieving an 85% accuracy in [10] Piyush P Gawali, Dattatray G Takale, Gopal B Deshmukh, Shraddha S
Kashid, Parikshit N Mahalle, Bipin Sule, Patil Rahul Ashokrao, and
mood matching. The sentiment analysis model, trained over Deepak R Derle, “Enhancing speech emotion recognition combining
30 iterations, achieved a training precision of 94% and a val- silence elimination,” ICT for Intelligent Systems: Proceedings of ICTIS
idation accuracy of 90%, with the loss function stabilizing at 2024, Volume 4, p. 409.
[11] Dave Rodriguez and Bryan J Brown, “Comparative analysis of
0.08. The steepest improvement in accuracy occurred between automated speech recognition technologies for enhanced audiovisual
the fifth and 15th epochs, with a maximum accuracy slope accessibility,” Code4Lib Journal, 2023.
of 0.2 per iteration, reflecting the efficiency of the model in [12] Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass,
“Whisper-at: Noise-robust automatic speech recognizers are also strong
learning emotional patterns. general audio event taggers,” arXiv preprint arXiv:2307.03183, 2023.
The final video output exhibits improved flow and reduced [13] Lakshay Saini, Prachi Verma, and Sumedha Seniaray, “The sentiment
pauses, with the original video duration reduced from 4 analysis and emotion detection of covid-19 online education tweets using
ml techniques,” in AIP Conference Proceedings. AIP Publishing, 2024,
to 2.5 minutes, as shown in Figure 5. This improvement vol. 3072.
streamlines the viewing experience, maintaining the attention [14] Mr Punith SY, Mr Sanjay HP, and Ms Sanjana A Hiremath, “3 way
of the audience. In addition, the trimmed video incorporates emotion detection,” .
[15] Sanidhya Mangal, Rahul Modak, and Poorva Joshi, “Lstm based music
sentiment-driven background music and contextually relevant generation system,” arXiv preprint arXiv:1908.01080, 2019.
visual effects, contributing to a cohesive and engaging nar- [16] Yongjie Huang, Xiaofeng Huang, and Qiakai Cai, “Music generation
rative. The system automates repetitive tasks, such as silence based on convolution-lstm.,” Comput. Inf. Sci., vol. 11, no. 3, pp. 50–56,
2018.
removal and synchronization, significantly reducing manual
effort and editing time while enhancing video quality. These
results validate the effectiveness of the EDIQUICK system in
simplifying video editing workflows, making it a practical and
user-friendly tool for content creators.
VI. C ONCLUSION
This study presents EDIQUICK, an innovative approach to
automated video editing that addresses key challenges in tra-
ditional workflows, such as time consumption and complexity.
Using audio analysis, sentiment recognition, and AI-driven
video enhancement techniques, the system delivers efficient,
high-quality editing results. The integration of features such as
silence detection, sentiment-based music selection, and NLP-
powered visual context adds significant value, enabling users
to produce professional-grade videos effortlessly. Future work
will focus on expanding the system’s capabilities to support
real-time editing, multilingual transcription, and advanced
personalization to cater to a wider range of applications and
user needs.
R EFERENCES
[1] Than Htut Soe, “Semi-automation in video editing,” 2022.
[2] Than Htut Soe, “Automation in video editing: Assisted workflows in
video editing.,” in AutomationXP@ CHI, 2021.
Fig. 6. happy accuracy Fig. 10. peaceful accuracy
Fig. 7. happy loss Fig. 11. peaceful loss
Fig. 8. neutral accuracy Fig. 12. energetic accuracy
Fig. 9. neutral loss Fig. 13. energetic loss
Fig. 14. sad loss
Fig. 15. sad accuracy