0% found this document useful (0 votes)
19 views5 pages

A Machine Learning Based Approach To Video Summarization

This paper proposes a novel machine learning approach for video summarization using a semi-supervised learning algorithm that generates summaries based on viewer attention and labeled samples. The method involves clustering visual and audio features from videos and employing a state transition machine representation to optimize the summary generation process. Extensive testing on different video classes demonstrates the effectiveness of the approach, particularly when combining aural and visual features.

Uploaded by

asha kale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

A Machine Learning Based Approach To Video Summarization

This paper proposes a novel machine learning approach for video summarization using a semi-supervised learning algorithm that generates summaries based on viewer attention and labeled samples. The method involves clustering visual and audio features from videos and employing a state transition machine representation to optimize the summary generation process. Extensive testing on different video classes demonstrates the effectiveness of the approach, particularly when combining aural and visual features.

Uploaded by

asha kale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A Machine Learning based Approach to Video

Summarization
Varun Luthra∗ , Jayanta Basak† Prof. Santanu Chaudhury∗,and K.A.N.Jyothi∗,
∗ ElectricalEngineering Department
Indian Institute of Technology,
Hauz Khas, New Delhi
Email: schaudhury@gmail.com
jyothi1984@gmail.com
† IBM India Research Lab,

New Delhi

Abstract—In this paper we have proposed a novel machine that aims at bridging the gap between low level features
learning based approach to video summarization problem. Semi- and human perception by analyzing viewers attention to
supervised learning algorithm has been used to generate the generate the summary. Sundaram and Chang [Sundaram 01]
summaries. Manually generated summaries (ideal summaries)
serves as the labeled samples for the semi-supervised learning. uses Kolmogorov complexity as a measure of video shot
Both visual and aural feature vectors taken over a set of videos complexity, and compute the video summary according to both
are clustered and the individual video sequences are represented video shot complexity and additional semantic information
as vector-quantized time series. Then state transition machine under a constrained optimization formulation. Divakaran et
based representation has been generated for both the complete al. [Divakaran 03] have used MPEG-7 motion activity and
class of videos and the labeled samples. A new information
theoretic measure has been proposed for the goodness of a audio descriptor to generate the video summary. Parshin et
generated summary, which reduces the summarization process to al. proposed another approach for video summarization using
finiding the sequence of frames for which the value of goodness user defined constraints and preferences in [Parshin 04]. Zhang
measure is maximum. et al. proposed a different approach to video summarization
using graph modelling and motion attention [Zhang 05].
I. I NTRODUCTION Though a lot of work has been done in the past, not much
The rapidly growing amount of digital video in todays effort has been put in development of machine learning based
network gives rise to the problem for efficient browsing and algorithms for the purpose of summarization. Generating a
management of video data. To solve these problems, Video perfect summary for a given video requires good under-
summarization, which offers concise representation of original standing of video semantic content. It may be difficult to
video clips by showing the most representative sysnopsis is capture the semantics of a video using most of the existing
gaining more attention. There are two fundamental types of techniques. Machine learning provides a way to capture the
video summaries: static video abstract, which is a sequence of hidden semantic contents of a video sequence. We present
key frames and dynamic video skimming, which is a collection a semi supervised learning based algorithm to generate the
of dynamically-composed audio-video sub-clips, and in both summaries for videos known to belong to a particular class of
cases the goal is to find the most interesting or important video videos. Our algorithm assumes that the user will have a similar
segments that capture the essence of the original clips. interest pattern over a class of videos. We have tested our
Some work has been done in the past in the field of video technique extensively over two different class of videos and
summarization in various prespectives. For static video sum- the subjective Precision-Recall graphs have also been drawn
mary, most early work selects key frame images by random to find the effectiveness of our algorithm.
or uniform sampling, like the Mini Video systems [Taniguchi
95]. Later work tends to extract key frame images by adapting II. V IDEO S EQUENCE R EPRESENTATION
to the video content. A mosaic based approach is suggested in Conventional approaches to video summarization have used
[Lee 97]. In [Rui 99], the authors analyzed the video structure frame-based features to generate a representation for the
after video segmentation, and then got a tree structured Video- videos. In this work, we have tried to capture the information
Table-Of-Contents(V-TOC). content present in the transition between the frames. Various
For video skimming generation, in the VA abstract system aural and visual features have been used here to capture
[Leinhart 97], key movie segments are selected to form a various aspects of similarities.
movie trailer. The informedia system [Kanade 97] selects
the video segments according to the occurence of important A. Feature Extraction
keywords in the corresponding caption text. Another effort 1) Audio features: The most common audio classes in
in this field is the attention model by Ma et al. [Ma 02] videos are speech, silence, music and the combination of
later three. These classes can be well distinguisged by using technique creates a network that stores information in such a
Short Time Energy (STE), Zero Crossing Rate (ZCR) and way that any topological relationships within the training set
Fundamental Frequency functions. The Short Time Energy are maintained.
function (STE) basically distinguishes speech and music and C. Sequential Representation of videos
Short Time Zero Crossing Rate (STZCR) is used to seperate
voiced speech from unvoiced one. Whereas, Short Time Fun- Once we have clustered the feature vectors for the complete
damental Frequency (STFF) seperates audio into harmonic and dataset, we go on with generating the sequential representation
nonharmonic classes. This way all the speech components are of individual videos. For each video, all the frame-transitions
well distinquished using these three features. are taken and seen to which cluster it belongs. This way we
Both STE and STZCR are calculated for every overlapping have cluster numbers for all the frame-transitions in a video.
window of 511 samples of the audio signal with an overlap D. State Machine Representation
of 35 samples at either end of the window at a sampling In our summary generation approach, we make an attempt
rate of 44100 samples/s. The STFF of the audio segment is to capture the semantics associated with a class of videos
estimated over an overlapping window of 2048 samples with and hence we must have a mathematical representation for
an overlap of 284 samples. When no fundamental frequency the complete class of videos.
is estimated, the STFF is set to zero. Once these features have Hence, we formulate the state machine representation for
been extracted, different audio classes are characterized using the class and also give complete derivations for transitions
statastical property of variance over overlapping windows of probabilities associated with the state machine. In our state
140 feature samples with an overlap of 40 samples at either machine represenation, states are the various clusters formed
end of window. Thus we obtain a feature with a sample for using the Kohonen clustering and we compute the transi-
every second of the audio. . tion probabilities for the state machine using the Bayesian
2) Visual Features: Color histogram, Edge histogram and approach. Our state transition machine can be considered
Texture similarities are the three visual features that are used as a first order markov process in which the probability of
in the current study. In this work we employ the similarity transition to the next state depends only on the current state.
between the histograms of two consecutive images as one 1) Computing State Transition Probabilities: From the
of the features characterizing the transition. We use the video sequence representation, we have cluster numbers for
histogram intersection technique to measure the similarity each of the transitions in the video sequence. Let us say that
between fames. Let ’h’ and ’g’ be the two histograms. Then, transition from frame 1 to frame 2 belongs to cluster number
similarity is defined as ’i’ and that the transition from frame 2 to frame 3 belongs to
Pn cluster number ’j’. So, when we go from frame 1 to frame 3,
min(h(i), g(i))
sim(h, g) = i=1 (1) we actually have a transition from cluster ’i’ to cluster ’j’. In
max(|h|, |g|)
the same way, we count the number of transitions from each
where, ′ h′ and ′ g ′ are the two frames in the video sequence cluster to every other cluster for all the videos in the dataset.
for which similarity measure has to be computed, |h| and |g| If Nij is the number of transition from cluster ’i’ to cluster
gives the magnitude sum of each histogram and n is the size ’j’, then Pij , transition probability from state ’i’ to state ’j’ is
of the histogram. The same equation has been used to find the given by equation 2.
color histogram, edge histogram and texture similarity measure
(Nij + αij )
between the two frames. Pij = Pm Pm (2)
Following the extraction of audio and visual features we use ( j=1 Nij + j=1 αij )
feature fusion to integrate the two to form one combined audio Where, αij are the parameters associated with the dirichlet
visual feature vector. The audio features (with the exception prior.
of short time fundamental frequency) are produced at the rate
III. S UMMARY G ENERATION
of 1 sample / sec. whereas video features are obtained at 25
samples / sec. In order to bring them both to the same temporal A. Use of Ideal Summaries
scale we take the median value of visual features for every 25 By ideal summary, we mean a manually generated summary
features. This gives us a feature vector that is generated at a that the user feels captures the essence of the video. Such
rate of 1 sample / sec. Such a feature vector combines both the summaries are given as an input to the learning system.
audio as well as the visual cues in the video sequence while These ideal summaries are used to formulate a state transition
taking into account their interdependence. machine (Q) and estimate the corresponding probabilities in
the same way as we did for generating the state machine
B. Clustering of feature vectors for the complete class of videos. These two state machines
Once the feature vectors for all the videos in the dataset are in turn used to measure the goodness of any generated
have been computed, self-organizing-map (SOM) approach summary. So, the summary generation task reduces to finding
proposed by Kohonen has been used to cluster the data. One of the summary for which this goodness measure is maximum.
the reasons for choosing Kohonen clustering approach rather The block diagram for the overall summary generation system
than conventional techniques like k-means etc. is that Kohonen is shown in the Fig 1.
Parents
Video Feature Kohnen Selection Modification
Extraction clustering
Modified
offspring
Cluster numbers
Initiate& Evaluation
Vector quantized evaluate Population Evaluated
Individual video time series State machine Deleted
frames generator offspring members
generator

Discard
Summary Target
State generator frames
Fig. 2. GA evaluatory cycle
Ideal frames machine
generator
nding out this sequence of frames. Genetic Algorithms is one
Summary such class of algorithms.
frames
C. Genetic Algorithm
Fig. 1. Block diagram of summarization system
Genetic Algorithms encode a potential solution to a specific
problem on a simple chromosome-like data structure and
B. Finding goodness measure apply recombination operators to these structures so as to
preserve critical information. A block diagram depicting the
Let us assume a m frame video for which the summary has
GA evoluationary cycle isshown in Fig 2.
to be generated. The given video is known to belong to a class
In the present study, GA has been used to find the sequence
and we have the matrices P and Q for this class of videos. For
of frames that optimize the goodness measure defined in
the sake of convenience, let us also assume that we have to
equation 3. Here, we represent the sequence of frames by bit
generate a 4 frame summary (i.e the essence of the whole
strings in a simple chromosome like data structure and the bit
video can be seen in just 4 frames). So our task is to find the
string should have a property that even after crossover and
indices [p,q,r,s] of the four summary frames. Assume that the
mutation operations it should represent a valid sequence of
cluster numbers associated with these frames are [i,j,k,l]. Let
frame numbers. Once we have choosen the current generation
us define I, the goodness measure for the generated summary
population members, next step is that choosing the members
as:
from the current generation for the next generation according
to their fitness. Roulette wheel selection method has been used
qci cj
I = qci cj ∗ log for this purpose. Then reproduction of chromosomes will be
pci ci+1 ∗ pci+1 ci+2 ∗ · · · ∗ pcj−1 cj done by Crossover and Mutation operations. In crossover, two
qcj ck
+ qcj ck ∗ log chromosomes are used to generate two new chromosomes
pcj cj+1 ∗ pcj+1 cj+2 ∗ · · · ∗ pck−1 ck whereas, in mutation single chromose will be mofified to
qck cl generate a new chromosome. Good values for crossover and
+ qck cl ∗ log (3)
pck ck+1 ∗ pck+1 ck+2 ∗ · · · ∗ pcl−1 cl mutation rates are usually around 0.7 and 0.01 respectively.
Then, for the purpose of evaluation, the chromosome is first
where, ci is the cluster number associated with frame converted into a sequence of summary frame numbers. These
number i, pjk is the transition probability from cluster number numbers are then plugged into the goodness measure equation
j to cluster number k of P matrix and qjk is the transition 3. The value for the goodness measure returned by this is taken
probability from cluster number j to cluster number k of Q as the value of the chromosome. These steps are repeated until
matrix. a fixed number of iterations are done. The best chromosome
For a summary consisting of a xed number of frames, this thus found is converted to the sequence of frames which is
measure basically evaluates the semantic closeness associated presented to the user as the best found summary.
between the frame transitions. The value of the goodness
measure is expected to be good for the sequence of frames IV. E XPERIMENTATION AND R ESULTS
whose frame transitions are relatively more probable for the After the algorithm was designed and complete system
given set of exam- ple summaries. That means the sequence was implemented, extensive testing was done to establish
of frames whose frame transitions best captures the semantics the effectiveness of our system. We have tested our system
of the video corresponds to the required summary. Now, the extensively over two different class of videos viz. home-shot
problem left is to nd out the sequence of frames that optimize party and Soccer videos. We have first tested our algorithm
goodness measure. We need an algorithm for the purpose of using only visual features and then tested by using aural
Fig. 3. GUI interface showing summary results for a home-shot party Video Fig. 4. GUI interface showing summary results for a soccer Video

features along with visual features. It is observed that there


is an improvement in the results when we consider aural
features along with the visual features. This improvement
can be well seen from the subjective verification. The GUI
interface showing results of 20-frame summaries for home-
shot party videos as well as for soccer videos using combined
aural and visual features have been given in Figs 3 and 4
respectively.
Summary verification is an extremely subjective task and the
quality of result will vary widely from subject to subject.
We have used subjective precision and recall measures to
asses the summary quality subjectively. To validate our re-
sults, we have performed ten-fold cross validation, five-fold
cross validation and two-fold cross validations. The Subjective
precision-Recall graphs of ten-fold cross validation for both
homeparty and soccer video data set for the two cases i.e with
and without using aural features are shown in Figs 5 and 6
respectively.

V. C ONCLUSION
We have presented a Video Summarization system for
summarizing videos known to belong to a class of videos.
We have used semi-supervised learning to capture the high
level semantics associated with a set of videos and to map Fig. 5. Subjective assesment of summary results for home-party video
them to low level features. Though, in past semi-supervised
learning has been used for a number of applications like
Face recognition, Gesture Recognition, Medical systems e.t.c., but this work is the first attempt at using a semi-supervised
Fig. 6. Subjective assesment of summary results for a soccer Video

learning algorithm for the purpose of video summarization.


ACKNOWLEDGMENT
This work has been supported by the grant for DST project
E-Video: Video Information Processing with Enhanced Func-
tionality.
R EFERENCES
[1] Y. Taniguchi,A. Akustu,Y. Tonomura and H. Hamada, An intuitive and
efficicent access interface to real-time incoming video based on automatic
indexing, in proceedings of the third ACM international conference on
multimedia, 1995, pp. 25-33.
[2] M. Lee, W. Chen, C. Lin,C. Gu and T .Marloc, A Layered video object
coding system using sprite and affine motion model, IEEE transactions
on circuits and systems for video technology, vol 1,pp. 130-145, 1997.
[3] Y. Rui,T.S. Huang and S. Mehrotra, Constructing table of content for
videos, ACM multimedia systems jouranal, Special issue multimedia
systems on video libraries, vol 7,no.5,pp. 359-368, 1990.
[4] R. Leinhart, S. Pfeiffer, and W. Effelsberg, Video abstracting, Communi-
cation of the ACM, pp.55-62, December 1997.
[5] M. Smith, and T. Kanade, Video skimming and characterization through
combination of image and language understanding techniques, Proceed-
ings of Computer Vision and Pattern Recognition, CVPR, 1997.
[6] Y.F. Ma, L. Lu, H.J. Zhang, and M.J. Li, A user attention model for video
summarization, In Proceedings of ACM Multimedia, pp. 533-542, 2002.
[7] H. Sundaram and S.F. Chang, Constrained Utility Maximization for
Generating Visual Skims, IEEE Workshop on Content-Based Access of
Image and Video Library, 2001.
[8] Ajay Divakaran, Video summarization using motion and audio descrip-
tors, Mitsubishi Electric Research Laboratories, 2003.
[9] Vyacheslav Parshin and Liming Chen, Video Summarization Based on
User-Defined Constraints and Preferences, in proceedings of RIAO, 2004.
[10] C.W. Ngo, Y.F. Ma and H.J. Zhang, Video Summarization and Scene
Detection by Graph Modeling, in IEEE Trans on Circuits and Systems
for Video Technology, 2005.

You might also like