0% found this document useful (0 votes)
20 views9 pages

Muhammad 2018

This manuscript presents an energy-efficient CNN-based framework for summarizing surveillance videos tailored for resource-constrained devices. It introduces a novel shot segmentation method utilizing deep features and selects keyframes based on image memorability and entropy to create diverse and engaging summaries. The proposed method demonstrates superior performance compared to existing video summarization techniques on benchmark datasets.

Uploaded by

yerson sanchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Muhammad 2018

This manuscript presents an energy-efficient CNN-based framework for summarizing surveillance videos tailored for resource-constrained devices. It introduces a novel shot segmentation method utilizing deep features and selects keyframes based on image memorability and entropy to create diverse and engaging summaries. The proposed method demonstrates superior performance compared to existing video summarization techniques on benchmark datasets.

Uploaded by

yerson sanchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Accepted Manuscript

Efficient CNN based summarization of surveillance videos for


resource-constrained devices

Khan Muhammad , Tanveer Hussain , Sung Wook Baik

PII: S0167-8655(18)30384-2
DOI: https://doi.org/10.1016/j.patrec.2018.08.003
Reference: PATREC 7267

To appear in: Pattern Recognition Letters

Received date: 5 April 2018


Revised date: 10 July 2018
Accepted date: 4 August 2018

Please cite this article as: Khan Muhammad , Tanveer Hussain , Sung Wook Baik , Efficient CNN
based summarization of surveillance videos for resource-constrained devices, Pattern Recognition
Letters (2018), doi: https://doi.org/10.1016/j.patrec.2018.08.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

Highlights
 We propose an energy-efficient CNN-based framework for summarization of surveillance videos
 Our system uses a novel shot segmentation scheme using deep features
 Keyframes selection is based on image memorability, providing diverse and interesting summary

T
IP
CR
US
AN
M
ED
PT
CE
AC

1
ACCEPTED MANUSCRIPT

Pattern Recognition Letters


journal homepage: www.elsevier.com

Efficient CNN based summarization of surveillance videos for resource-constrained


devices

T
Khan Muhammad, Tanveer Hussain, Sung Wook Baik*

IP
Intelligent Media Laboratory, Digital Contents Research Institute, Sejong University, Seoul-143-747, Republic of Korea

CR
ABSTRACT

The widespread usage of surveillance cameras in smart cities has resulted in a gigantic volume of video data

US
whose indexing, retrieval and management is a challenging issue. Video summarization tends to detect
important visual data from the surveillance stream and can help in efficient indexing and retrieval of
required data from huge surveillance datasets. In this research article, we propose an efficient convolutional
neural network based summarization method for surveillance videos of resource-constrained devices. Shot
AN
segmentation is considered as a backbone of video summarization methods and it affects the overall quality
of the generated summary. Thus, we propose an effective shot segmentation method using deep features.
Furthermore, our framework maintains the interestingness of the generated summary using image
memorability and entropy. Within each shot, the frame with highest memorability and entropy score is
M

considered as a keyframe. The proposed method is evaluated on two benchmark video datasets and the
results are encouraging compared to state-of-the-art video summarization methods.
ED

Keywords: Video Analysis, Video Summarization, Surveillance, Energy-Efficiency, Resource-Constrained


Devices (DT)
based
1. Introduction clustering for selection of keyframes. Drew et al. [21] used both
PT

compression and clustering for efficient summarization of videos.


Wireless multimedia sensor networks have been extensively Another automatic VS method is presented in [22] where the
explored in smart cities for sensing, monitoring, event detection final number of clusters for summarization are determined using
and automatic reporting, which can be helpful for numerous a dynamic calculation mechanism. Furini et al. [23] proposed an
CE

applications such as disaster management [1, 2], e-health [3], approach known as “still and moving” (STIMO) where
surveillance [4, 5], ubiquities communication [6, 7], and security storyboards can be generated on-the-fly. De Avila et al. [24]
systems [8-14]. Despite their maturity up to some extent, their focused on static VS using color features and k-means clustering
real implementation is a challenge due to their increased as well as nomination of a novel summary evaluation approach.
AC

processing and transmission power needed for visual sensors to Mahmoud et al. [25] enhanced the VS pipeline via density-
process and transmit the video data. This also affects the decision assisted spatial clustering approach. Xu et al. [26] used clustering
making process which is based on observation of the surveillance along with semantical, emotional, and shoot-quality clues for
environment. Recently, intelligent surveillance mechanisms have summarization of user generated summaries.
been presented, attempting to reduce the processing and
bandwidth consumption. For instance, Irfan et al. [15] used Besides compression domain and clustering, numerous actions,
salient motion detection for image prioritization in multi-camera activities, and events based summarization methods are also
surveillance networks with reduced processing time and reliable proposed in recent years. For instance, Meghdadi et al. [27]
dissemination of important contents to base station via sink node. explored surveillance videos using summarization of its shots. A
Bradai et al. [16] presented “EMCOS”, which is a surveillance VS method using abnormality detection is presented
computationally friendly approach for streaming multimedia data in [28]. Mademlis et al. [29] investigated an activity based VS
in cognitive radio networks. Considering the increasing volume approach. Other related works using event detection for
of video data, Almeida et al. [17] presented an online video summarization are explored in [30] and [31]. Continuing the
summarization (VS) method in compressed domain which is later usage of recognition, Wang et al. [32] investigated web videos
extended by Wang et al. [18] for video synopsis. Another VS for summarization using an event driven approach with tags and
method for online applications using compression domain is recognition of key shots. Thomas et al. [33] emphasized on
presented in [19]. Mundur et al. [20] used Delaunay triangulation detection of perceptually important events for VS. Rabbouch et al.

2
ACCEPTED MANUSCRIPT
[34] used a cluster analysis based VS approach with application a four-threshold approach for shot segmentation based on which
to automatically count and recognize vehicles in surveillance anchor person is detected in news videos. Priya et al. [46] used
video stream. Zhang et al. [35] based their VS on spatio-temporal shot segmentation for indexing and retrieval of ecological
analysis for detection of object motion using which an attention videos. Li et al. [47] used sparse coding for shot segmentation,
curve is generated and keyframes are extracted. Besides which in turn is used for summarization. The mentioned methods
clustering, graphs based VS methods are also exploited. For are domain specific and their performance for surveillance videos
instance, Kuanar et al. [36] presented a VS approach with is limited. Considering this, we investigated deep features for
intelligent utilization of bipartite graph matching for modelling shot segmentation to intelligently divide the video stream into
inter-view dependencies in multi-view videos and optimum-path meaningful shots. Our approach extracts deep features from the
forest scheme for clustering. In continuation with this method, two consecutive frames to determine whether the underlying
[37] used graph-assisted hierarchical method for VS. frames belong to the same or different shot. Features are
extracted from the fully connected layer (FC7) of CNN model
In an attempt to extend the usefulness of VS, Bagheri et al.
which is trained using MobileNet architecture (version 2) on
[38] proposed a method for temporal mapping of surveillance
ImageNet dataset. This model is originally trained for
videos for indexing and searching. Kannan et al. [39] enriched
classification but has learned global discriminative features,
the summarization pipeline with personalization for movies data.
which can be used for other purposes. After our analysis, we
Varini et al. [40] explored personalized VS for egocentric videos
found that deep features learned at higher full connected layer
with consideration of user preferences. Chen et al. [41] focused

T
(FC7) of this model, are suitable for shot segmentation, thus, we
on resource allocation for personalized VS. Hamza et al. [42]
used it in our framework.
used both personalization and privacy preservation for medical

IP
videos summarization. Sparse coding is also used in the After features extraction, the extracted deep features are
summarization pipeline. For instance, Mei et al. [43] used compared using Euclidean distance between them, whose
minimum sparse reconstruction strategy for summarization. formula is given in Equation 1.

CR
Sparse coding with context-awareness for surveillance videos is
presented by [44]. n

Summarizing the current literature, it can be observed that


distt  (X
i 1
i  X i )2 (1)
certain methods are capable to generate good summaries but their
computationally expensive behaviour is limiting their usefulness
for surveillance networks and devices with constrained resources.
On the other hand, the efficient VS methods are not competitive
enough to detect important frames from real surveillance stream.
US Herein n is the total number of pixels in the frame, X is the
first frame and X is the next selected frame. The value returned
from this function is normalized between 0 and 1. The optimal
threshold selected for determining the same and different shots is
AN
0.7, which is decided after several experiments. The two frames
Considering these challenges, in this paper, we present an
having Euclidean distance less than or equal to 0.7, are
efficient VS method for devices with constrained resources. We
considered to be from the same shot. Otherwise, the underlying
attempted to contribute to shot segmentation and keyframes
two frames are considered to be of different shots.
selection of the video summarization pipeline. Shot segmentation
is the prerequisite and most important step as the generated
M

2.2 Features extraction


summary is heavily dependent on it. Therefore, we used deep
features extracted from convolutional neural networks (CNNs) It is well-agreed that every individual is aware of how an
for shot segmentation. To measure the importance of frames, we image or a scene is memorable in his mind, depending upon the
image or scene. Certain images get more attention of users
ED

used image memorability and entropy, based on which keyframes


are chosen from each video shot. Our scheme is efficient as well compared to others. For instance, images having people and
as intelligent enough to nominate important frames in events such as fight, romance, and actions etc., are more
surveillance stream, thus, making it a best fit for embedded memorable to users compared to images with normal neutral
processing and adaptation in visual surveillance networks. events. Thus frames having people in center obtain higher
PT

memorability values compared to people on right or left of the


The rest of the paper is structured as follows. Section 2 frame. The same case can be observed in videos. Video frames
explains our proposed framework with experimental results and having people and some salient objects attract the user attention
discussion in Section 3. Section 4 concludes this work with and have higher probability of being selected as keyframes.
CE

recommendation for future research. Sample keyframes with their memorability scores are shown in
2. The proposed framework Fig. 2. The frames having more objects and in the center have
received higher memorability scores compared to frames with
Motivated from the strength of CNNs for various applications, either few objects or objects at the sides of the frames.
AC

this framework aims to propose an energy-efficient CNN based


The image memorability model is trained using a large image
VS method for surveillance videos captured by resource-
memorability dataset [48], having 60,000 images from diverse
constrained devices. The framework is three-fold: shot
resources using CNNs. This dataset consists of images with
segmentation using deep features, computing image
memorability score ranges from 0 to 1. Thus, the final trained
memorability and entropy for each frame of the shots, and
classifier outputs a real value between 0 and 1.
keyframes selection from each shot for summary generation. An
additional post-processing step using color histogram difference
is finally used to discard the duplicate frames. The framework
with its main components is shown in Fig. 1.
2.1 Shot segmentation
Shot segmentation is a challenging problem for VS methods
as the diversity, coverage, and interestingness of the video Fig. 1. The proposed framework
summary is dependent on the segmented shots. Recently,
numerous shot segmentation methods have been presented for
different applications including anchor detection in news videos,
VS, video indexing and retrieval. For instance, Ji et al. [45] used
3
ACCEPTED MANUSCRIPT
format. The dataset videos are distributed in several categories
like documentary, educational, ephemeral, historical, and lecture.
The length of videos varies from 1 to 4 minutes and the total
duration of the whole dataset is approximately 75 minutes. Using
this dataset, the proposed method is compared with several
methods including OV [49], DT [20], STIMO [23], VSCAN
[50], and VSUMM [24] using standard evaluation metrics of
precision, recall, and F-measure [51, 52] whose details are given
in Eq. 2-Eq. 4.
Precision: It corresponds to the accuracy of a method by
making use of the extracted false keyframes as shown in
Equation 2.
N matched
Pr ecison  (2)
Nextracted
Herein “Nmatched” shows the number of keyframes matched
with the ground truth summary and the summary of our proposed

T
method and “Nextracted“ represents the number of extracted
keyframes in a summary.

IP
Fig. 2. Sample frames with memorability scores predicted by
trained image memorability model Recall: It corresponds to the possibility of extraction over all
Selecting keyframes using only memorability score may result ground truth keyframes.

CR
in a summary that may not adequately represent the entire video. N matched
Memorability does not preserve the heterogeneity of keyframes Re call  (3)
N groundtruth
and thus we added entropy measure in our framework to maintain
Herein “Ngroundtruth” represents the number of keyframes in the
the diversity of our summary. Entropy of an image shows the
ground truth summary. The F-measure is calculated using Eq. 4.
amount of information. Thus, the features extraction phase
includes memorability prediction and computing entropy score.
2.3 Keyframe extraction
US
information inside it and its score is directly proportional to the
F
2 * (Pr ecision * Re call)
Pr ecision  Re call
(4)
A set of representative videos are shown in Table 1 along with
their F-measure score. From Table 1, it can be observed that F-
AN
measure of the proposed system is much higher than every single
Keyframes extraction is the final step of VS pipeline for
existing technique on the same video. The last row in Table 1
which an attention curve is constructed using the computed
shows the average score of F-measure for each method under
features. In our case, we consider image memorability and
observation and shows the superiority of our scheme in
entropy as a metric to generate an attention curve for each shot of
M

generating relevant video summaries.


the input video. Within each shot, the frame with maximum
memorability and entropy score is considered as the keyframe. 3.2 Results on YouTube database
Next, the keyframes from all shots are combined to constitute a
video summary. Finally, a post-processing step is used to These 50 videos are downloaded from YouTube [53]
ED

eliminate the redundant frames. This step uses color histogram belonging to different genres ranging from 1 to 10 minutes. The
difference to discard frames of same shots that are selected as results of our method based on this dataset are compared with 5
keyframes. The advantage of this step is explained through an users generated summaries, VSUMM [24], and Fei et al. [54]
example in Section 3. method. The main difference between the proposed method and
PT

[54] is the way shots are segmented. They have used perceptual
3. Experimental results and discussion hashing for shot segmentation whereas we have employed deep
features which are much powerful than traditional methods. The
The proposed method is tested on two different videos
details of shot segmentation are already given in section 2.1. For
datasets and the results are compared with state of the art
CE

comparison, a representative video from several video genres are


methods for performance evaluation. In the field of static video
selected and results are compared with other methods as well as
summarization, different methods evaluate their experimental
with 5 users generated summaries as shown in Table 2. The best,
results on different datasets. Many of them neither made the
average, and worst F-measure scores are computed by its
datasets publically available nor publicize the implementation
AC

comparison with keyframes generated by different subjects. The


details for other researchers. Thus, the comparison with large
summary generated by our method scored best compared to other
number of VS techniques is nearly impossible. Considering this,
methods, showing that the generated summary is representative
we compared our scheme with five methods based on two
enough to satisfy the users’ needs.
publically available video datasets, each containing 50 videos.
The reason is the diversity and several different video categories Figure 3 and Fig. 4 show the visual comparison of generated
of the two selected datasets. It is challenging and time consuming summaries of our method with [54], VSUMM [24], and 5
to find the ground truth summary for each video in a dataset and different users generated summaries. From the whole dataset,
due to this difficulty, benchmark VS datasets are comparatively only representative videos are chosen and their summaries are
less in literature. The selected datasets contain five user’s shown. In Fig. 3, it can be observed that the summary generated
summaries available for each video, making its quantitative by our method consists of salient frames and is near to
evaluation and comparison with state-of-the-art easy and straight- summaries of Users. Also, our summary contains visually
forward. important frames that are marked by users in their generated
summaries. Figure 4 shows the summary of video# 88 in which
3.1 Results on Open Video project database our method has extracted comparatively more number of frames
The open video (OV) dataset [49] consists of videos in in comparison with some user generated summaries but these
MPEG-1 format with 30 fps and dimension 352 x 240 pixels. The additional frames are also salient and memorable.
videos have sound information and the frames are in RGB
4
ACCEPTED MANUSCRIPT
To select the best set of representative frames, a post- time of a system to highlight its applicability. To this end, our
processing step is used to discard any nearly duplicate frames method can process 18 fps, making it a suitable summarization
from the generated summary. The advantage of this step is approach for visual surveillance.
evident from Fig. 5. Finally, it is important to report the running

Table 1: Sample results for selected videos using F-measure score


Video No OV [49] DT [20] STIMO [23] VSUMM [24] VSCAN [50] Proposed Method
V1 0.66 0.24 0.60 0.58 0.58 0.78
V3 0.74 0.66 0.46 0.73 0.77 0.88
V5 0.46 0.31 0.54 0.75 0.87 0.72
V49 0.70 0.67 0.65 0.84 0.85 0.80
V50 0.77 0.77 0.89 0.71 0.81 0.8
Average F-measure
0.66 0.53 0.62 0.72 0.77 0.79
score

Table 2: Mean F-measure of each category sample video of the proposed method with VSUMM, [54], and user summaries
Users Proposed

T
Category Video # VSUMM [24] [54]
Worst Average Best Method
V11 0.67 0.74 0.77 0.68 0.82 0.82

IP
Cartoons
V12 0.65 0.71 0.76 0.65 0.67 0.73
News V88 0.68 0.76 0.84 0.67 0.75 0.78
Home V108 0.57 0.67 0.79 0.52 0.73 0.73

CR
Average F-measure 0.64 0.72 0.79 0.63 0.74 0.76

US
AN
M
ED
PT

Fig. 3. Video summary of our method, users, VSUMM [24], and [54] for video “v11”
CE
AC

Fig. 4. Video summary of our method, users, VSUMM [24], and [54] for video “v88”

5
ACCEPTED MANUSCRIPT

This work was supported by the National Research


Foundation of Korea (NRF) grant funded by the Korea
government (MSIP) (No.2016R1A2B4011712).
References
[1] K. Muhammad, J. Ahmad, Z. Lv, P. Bellavista, P.
Yang, and S. W. Baik, "Efficient Deep CNN-Based
Fire Detection and Localization in Video Surveillance
Applications," IEEE Transactions on Systems, Man,
and Cybernetics: Systems, pp. 1-16, 2018.
[2] K. Muhammad, J. Ahmad, I. Mehmood, S. Rho, and S.
W. Baik, "Convolutional Neural Networks Based Fire

T
Detection in Surveillance Videos," IEEE Access, vol. 6,
pp. 18174-18183, 2018.
[3] K. Wang, Y. Shao, L. Shu, C. Zhu, and Y. Zhang,

IP
"Mobile big data fault-tolerant processing for ehealth
networks," IEEE Network, vol. 30, pp. 36-42, 2016.
[4] X. Chang, Z. Ma, Y. Yang, Z. Zeng, and A. G.

CR
Hauptmann, "Bi-level semantic representation analysis
for multimedia event detection," IEEE transactions on
cybernetics, vol. 47, pp. 1180-1197, 2017.
Fig. 5. Effect of post-processing on video summary. (a) original [5] X. Chang, Y.-L. Yu, Y. Yang, and E. P. Xing,
summary generated by our method (b) Summary after removing "Semantic pooling for complex event analysis in
redundancy through histogram difference untrimmed videos," IEEE transactions on pattern

4 Conclusion and future work


Convolutional neural networks have showed state-of-
the-art performance for addressing different problems of
US [6]
analysis and machine intelligence, vol. 39, pp. 1617-
1632, 2017.
G. Han, L. Liu, W. Zhang, and S. Chan, "A hierarchical
jammed-area mapping service for ubiquitous
communication in smart communities," IEEE
AN
Communications Magazine, vol. 56, pp. 92-98, 2018.
video surveillance such as event detection and scene [7] Y. Wu, F. Hu, G. Min, and A. Y. Zomaya, Big Data
recognition. Recently, they have been applied to video and Computational Intelligence in Networking: CRC
summarization, however, their computationally Press, 2017.
complexity is limiting its suitability for surveillance [8] K. Muhammad, J. Ahmad, and S. W. Baik, "Early fire
networks with limited resources. With this motivation, detection using convolutional neural networks during
M

surveillance for effective disaster management,"


we studied light-weight CNNs and propose an efficient Neurocomputing, vol. 288, pp. 30-42, 2018/05/02/
VS method for surveillance videos in resource- 2018.
constrained scenarios. Our method is three-fold: shot [9] K. Muhammad, R. Hamza, J. Ahmad, J. Lloret, H. H.
segmentation using deep features, memorability and G. Wang, and S. W. Baik, "Secure Surveillance
ED

entropy score prediction, and summary generation. Shot Framework for IoT systems using Probabilistic Image
Encryption," IEEE Transactions on Industrial
segmentation is the most critical step of video Informatics, 2018.
summarization, therefore, we performed it through deep [10] A. B. Mabrouk and E. Zagrouba, "Abnormal behavior
features to intelligently divide the surveillance video recognition for intelligent video surveillance systems: A
PT

into meaningful shots. Next, the memorability of each review," Expert Systems with Applications, 2017.
frame within the shot is predicted using a fine-tuned [11] X. Chang and Y. Yang, "Semisupervised feature
analysis by mining correlations among multiple tasks,"
image memorability prediction model, followed by IEEE transactions on neural networks and learning
entropy measure. Finally, the frame with highest systems, vol. 28, pp. 2294-2305, 2017.
CE

memorability and entropy score within each shot is [12] X. Chang, Z. Ma, M. Lin, Y. Yang, and A. G.
picked to constitute the final summary. The obtained Hauptmann, "Feature interaction augmented sparse
results based on two benchmark datasets show the learning for fast kinect motion detection," IEEE
Transactions on Image Processing, vol. 26, pp. 3911-
significance of this method for adaptability in resource-
3920, 2017.
AC

constrained surveillance networks for summarization. [13] K. Muhammad, M. Sajjad, and S. W. Baik, "Dual-Level
Security based Cyclic18 Steganographic Method and its
The current method can summarize video stream Application for Secure Transmission of Keyframes
with 18 fps. Further research is needed to improve this during Wireless Capsule Endoscopy," Journal of
processing rate and combine it with spectrum sensing Medical Systems, vol. 40, p. 114, 2016.
technologies for smarter surveillance. We also plan to [14] M. Sajjad, S. Khan, T. Hussain, K. Muhammad, A. K.
apply and extend this work to other resource constrained Sangaiah, A. Castiglione, et al., "CNN-based anti-
spoofing two-tier multi-factor authentication system,"
environments such as wireless sensor networks [55, 56] Pattern Recognition Letters, 2018/02/22/ 2018.
and internet of things [57, 58]. [15] I. Mehmood, M. Sajjad, W. Ejaz, and S. W. Baik,
"Saliency-directed prioritization of visual data in
Acknowledgment wireless surveillance networks," Information Fusion,
vol. 24, pp. 16-30, 2015.

6
ACCEPTED MANUSCRIPT

[16] A. Bradai, K. Singh, A. Rachedi, and T. Ahmed, [32] M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S.
"EMCOS: Energy-efficient mechanism for multimedia Chua, "Event driven web video summarization by tag
streaming over cognitive radio sensor networks," localization and key-shot identification," IEEE
Pervasive and Mobile Computing, vol. 22, pp. 16-32, Transactions on Multimedia, vol. 14, pp. 975-985,
2015. 2012.
[17] J. Almeida, N. J. Leite, and R. d. S. Torres, "Online [33] S. S. Thomas, S. Gupta, and V. K. Subramanian,
video summarization on compressed domain," Journal "Perceptual Video Summarization—A New Framework
of Visual Communication and Image Representation, for Video Summarization," IEEE Transactions on
vol. 24, pp. 729-738, 2013. Circuits and Systems for Video Technology, vol. 27, pp.
[18] S.-z. Wang, Z.-y. Wang, and R.-m. Hu, "Surveillance 1790-1802, 2017.
video synopsis in the compressed domain for fast video [34] H. Rabbouch, F. Saâdaoui, and R. Mraihi,
browsing," Journal of Visual Communication and "Unsupervised video summarization using cluster
Image Representation, vol. 24, pp. 1431-1442, 2013. analysis for automatic vehicles counting and

T
[19] J. Almeida, N. J. Leite, and R. d. S. Torres, "Vison: recognizing," Neurocomputing, vol. 260, pp. 157-173,
Video summarization for online applications," Pattern 2017.
Recognition Letters, vol. 33, pp. 397-409, 2012. [35] Y. Zhang, R. Tao, and Y. Wang, "Motion-state-

IP
[20] P. Mundur, Y. Rao, and Y. Yesha, "Keyframe-based adaptive video summarization via spatiotemporal
video summarization using Delaunay clustering," analysis," IEEE Transactions on Circuits and Systems
International Journal on Digital Libraries, vol. 6, pp. for Video Technology, vol. 27, pp. 1340-1352, 2017.

CR
219-232, 2006. [36] S. K. Kuanar, K. B. Ranga, and A. S. Chowdhury,
[21] M. S. Drew and J. Au, "Clustering of compressed "Multi-view video summarization using bipartite
illumination-invariant chromaticity signatures for matching constrained optimum-path forest clustering,"
efficient video summarization," Image and Vision IEEE Transactions on Multimedia, vol. 17, pp. 1166-
Computing, vol. 21, pp. 705-716, 2003. 1173, 2015.
[22] D. P. Papadopoulos, V. S. Kalogeiton, S. A. [37] L. dos Santos Belo, C. A. Caetano Jr, Z. K. G. do

[23]
Chatzichristofis, and N. Papamarkos, "Automatic
summarization and annotation of videos with lack of
metadata information," Expert Systems with
Applications, vol. 40, pp. 5765-5778, 2013.
M. Furini, F. Geraci, M. Montangero, and M.
US [38]
Patrocínio Jr, and S. J. F. Guimarães, "Summarizing
video sequence using a graph-based hierarchical
approach," Neurocomputing, vol. 173, pp. 1001-1016,
2016.
S. Bagheri, J. Y. Zheng, and S. Sinha, "Temporal
AN
Pellegrini, "STIMO: STIll and MOving video mapping of surveillance video for indexing and
storyboard for the web scenario," Multimedia Tools and summarization," Computer Vision and Image
Applications, vol. 46, pp. 47-69, 2010. Understanding, vol. 144, pp. 237-257, 2016.
[24] S. E. F. De Avila, A. P. B. Lopes, A. da Luz, and A. de [39] R. Kannan, G. Ghinea, and S. Swaminathan, "What do
Albuquerque Araújo, "VSUMM: A mechanism you wish to see? A summarization system for movies
designed to produce static video summaries and a novel based on user preferences," Information Processing &
M

evaluation method," Pattern Recognition Letters, vol. Management, vol. 51, pp. 286-305, 2015.
32, pp. 56-68, 2011. [40] P. Varini, G. Serra, and R. Cucchiara, "Personalized
[25] K. M. Mahmoud, M. A. Ismail, and N. M. Ghanem, Egocentric Video Summarization of Cultural Tour on
"Vscan: an enhanced video summarization using User Preferences Input," IEEE Transactions on
density-based spatial clustering," in International Multimedia, vol. 19, pp. 2832-2845, 2017.
ED

conference on image analysis and processing, 2013, pp. [41] F. Chen, C. De Vleeschouwer, and A. Cavallaro,
733-742. "Resource allocation for personalized video
[26] B. Xu, X. Wang, and Y.-G. Jiang, "Fast Summarization summarization," IEEE Transactions on Multimedia,
of User-Generated Videos: Exploiting Semantic, vol. 16, pp. 455-469, 2014.
Emotional, and Quality Clues," IEEE MultiMedia, vol. [42] R. Hamza, K. Muhammad, Z. Lv, and F. Titouna,
PT

23, pp. 23-33, 2016. "Secure video summarization framework for


[27] A. H. Meghdadi and P. Irani, "Interactive exploration of personalized wireless capsule endoscopy," Pervasive
surveillance video through action shot summarization and Mobile Computing, vol. 41, pp. 436-450, 2017.
and trajectory visualization," IEEE Transactions on [43] S. Mei, G. Guan, Z. Wang, S. Wan, M. He, and D. D.
Visualization and Computer Graphics, vol. 19, pp. Feng, "Video summarization via minimum sparse
CE

2119-2128, 2013. reconstruction," Pattern Recognition, vol. 48, pp. 522-


[28] W. Lin, Y. Zhang, J. Lu, B. Zhou, J. Wang, and Y. 533, 2015.
Zhou, "Summarizing surveillance videos with local- [44] S. Zhang, Y. Zhu, and A. K. Roy-Chowdhury,
patch-learning-based abnormality detection, blob "Context-Aware Surveillance Video Summarization,"
sequence optimization, and type-based synopsis," IEEE Transactions on Image Processing, vol. 25, pp.
AC

Neurocomputing, vol. 155, pp. 84-98, 2015. 5469-5478, 2016.


[29] I. Mademlis, A. Tefas, and I. Pitas, "A salient [45] P. Ji, L. Cao, X. Zhang, L. Zhang, and W. Wu, "News
dictionary learning framework for activity video videos anchor person detection by shot clustering,"
summarization via key-frame extraction," Information Neurocomputing, vol. 123, pp. 86-99, 2014.
Sciences, vol. 432, pp. 319-331, 2018. [46] G. L. Priya and S. Domnic, "Shot based keyframe
[30] X. Song, L. Sun, J. Lei, D. Tao, G. Yuan, and M. Song, extraction for ecological video indexing and retrieval,"
"Event-based large scale surveillance video Ecological informatics, vol. 23, pp. 107-117, 2014.
summarization," Neurocomputing, vol. 187, pp. 66-74, [47] J. Li, T. Yao, Q. Ling, and T. Mei, "Detecting shot
2016. boundary with sparse coding for video summarization,"
[31] S. S. Thomas, S. Gupta, and V. K. Subramanian, "Event Neurocomputing, vol. 266, pp. 66-78, 2017.
Detection on Roads Using Perceptual Video [48] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva,
Summarization," IEEE Transactions on Intelligent "Understanding and predicting image memorability at a
Transportation Systems, 2017. large scale," in Computer Vision (ICCV), 2015 IEEE
International Conference on, 2015, pp. 2390-2398.

7
ACCEPTED MANUSCRIPT

[49] D. DeMenthon, V. Kobla, and D. Doermann, "Video


summarization by curve simplification," in Proceedings
of the sixth ACM international conference on
Multimedia, 1998, pp. 211-218.
[50] K. M. Mahmoud, M. A. Ismail, and N. M. Ghanem,
"VSCAN: An Enhanced Video Summarization Using
Density-Based Spatial Clustering," Berlin, Heidelberg,
2013, pp. 733-742.
[51] K. Muhammad, J. Ahmad, M. Sajjad, and S. W. Baik,
"Visual saliency models for summarization of
diagnostic hysteroscopy videos in healthcare systems,"
SpringerPlus, vol. 5, p. 1495, 2016.
[52] K. Muhammad, M. Sajjad, M. Y. Lee, and S. W. Baik,

T
"Efficient visual attention driven framework for key
frames extraction from hysteroscopy videos,"
Biomedical Signal Processing and Control, vol. 33, pp.

IP
161-168, 2017.
[53] J. D. Mitchell-Jackson, "Energy needs in an internet
economy: A closer look at data centers," Masters of

CR
Science Thesis from the Energy and Resources Group,
University of California at Berkeley, 2001.
[54] M. Fei, W. Jiang, and W. Mao, "Memorable and rich
video summarization," Journal of Visual
Communication and Image Representation, vol. 42, pp.
207-217, 2017.
[55] G. Han, L. Liu, S. Chan, R. Yu, and Y. Yang,
"HySense: A hybrid mobile crowdsensing framework
for sensing opportunities compensation under dynamic
coverage constraint," IEEE Communications Magazine,
vol. 55, pp. 93-99, 2017.
US
AN
[56] Y. Wu, G. Min, K. Li, and B. Javadi, "Modeling and
analysis of communication networks in multicluster
systems under spatio-temporal bursty traffic," IEEE
Transactions on Parallel and Distributed Systems, vol.
23, pp. 902-912, 2012.
[57] G. Han, L. Zhou, H. Wang, W. Zhang, and S. Chan, "A
M

source location protection protocol based on dynamic


routing in WSNs for the Social Internet of Things,"
Future Generation Computer Systems, 2017.
[58] Y. Ma, Y. Wu, J. Ge, and J. Li, "An Architecture for
Accountable Anonymous Access in the Internet-of-
ED

Things Network," IEEE Access, 2018.


PT
CE
AC

You might also like