Muhammad 2018
Muhammad 2018
PII: S0167-8655(18)30384-2
DOI: https://doi.org/10.1016/j.patrec.2018.08.003
Reference: PATREC 7267
Please cite this article as: Khan Muhammad , Tanveer Hussain , Sung Wook Baik , Efficient CNN
based summarization of surveillance videos for resource-constrained devices, Pattern Recognition
Letters (2018), doi: https://doi.org/10.1016/j.patrec.2018.08.003
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
We propose an energy-efficient CNN-based framework for summarization of surveillance videos
Our system uses a novel shot segmentation scheme using deep features
Keyframes selection is based on image memorability, providing diverse and interesting summary
T
IP
CR
US
AN
M
ED
PT
CE
AC
1
ACCEPTED MANUSCRIPT
T
Khan Muhammad, Tanveer Hussain, Sung Wook Baik*
IP
Intelligent Media Laboratory, Digital Contents Research Institute, Sejong University, Seoul-143-747, Republic of Korea
CR
ABSTRACT
The widespread usage of surveillance cameras in smart cities has resulted in a gigantic volume of video data
US
whose indexing, retrieval and management is a challenging issue. Video summarization tends to detect
important visual data from the surveillance stream and can help in efficient indexing and retrieval of
required data from huge surveillance datasets. In this research article, we propose an efficient convolutional
neural network based summarization method for surveillance videos of resource-constrained devices. Shot
AN
segmentation is considered as a backbone of video summarization methods and it affects the overall quality
of the generated summary. Thus, we propose an effective shot segmentation method using deep features.
Furthermore, our framework maintains the interestingness of the generated summary using image
memorability and entropy. Within each shot, the frame with highest memorability and entropy score is
M
considered as a keyframe. The proposed method is evaluated on two benchmark video datasets and the
results are encouraging compared to state-of-the-art video summarization methods.
ED
applications such as disaster management [1, 2], e-health [3], approach known as “still and moving” (STIMO) where
surveillance [4, 5], ubiquities communication [6, 7], and security storyboards can be generated on-the-fly. De Avila et al. [24]
systems [8-14]. Despite their maturity up to some extent, their focused on static VS using color features and k-means clustering
real implementation is a challenge due to their increased as well as nomination of a novel summary evaluation approach.
AC
processing and transmission power needed for visual sensors to Mahmoud et al. [25] enhanced the VS pipeline via density-
process and transmit the video data. This also affects the decision assisted spatial clustering approach. Xu et al. [26] used clustering
making process which is based on observation of the surveillance along with semantical, emotional, and shoot-quality clues for
environment. Recently, intelligent surveillance mechanisms have summarization of user generated summaries.
been presented, attempting to reduce the processing and
bandwidth consumption. For instance, Irfan et al. [15] used Besides compression domain and clustering, numerous actions,
salient motion detection for image prioritization in multi-camera activities, and events based summarization methods are also
surveillance networks with reduced processing time and reliable proposed in recent years. For instance, Meghdadi et al. [27]
dissemination of important contents to base station via sink node. explored surveillance videos using summarization of its shots. A
Bradai et al. [16] presented “EMCOS”, which is a surveillance VS method using abnormality detection is presented
computationally friendly approach for streaming multimedia data in [28]. Mademlis et al. [29] investigated an activity based VS
in cognitive radio networks. Considering the increasing volume approach. Other related works using event detection for
of video data, Almeida et al. [17] presented an online video summarization are explored in [30] and [31]. Continuing the
summarization (VS) method in compressed domain which is later usage of recognition, Wang et al. [32] investigated web videos
extended by Wang et al. [18] for video synopsis. Another VS for summarization using an event driven approach with tags and
method for online applications using compression domain is recognition of key shots. Thomas et al. [33] emphasized on
presented in [19]. Mundur et al. [20] used Delaunay triangulation detection of perceptually important events for VS. Rabbouch et al.
2
ACCEPTED MANUSCRIPT
[34] used a cluster analysis based VS approach with application a four-threshold approach for shot segmentation based on which
to automatically count and recognize vehicles in surveillance anchor person is detected in news videos. Priya et al. [46] used
video stream. Zhang et al. [35] based their VS on spatio-temporal shot segmentation for indexing and retrieval of ecological
analysis for detection of object motion using which an attention videos. Li et al. [47] used sparse coding for shot segmentation,
curve is generated and keyframes are extracted. Besides which in turn is used for summarization. The mentioned methods
clustering, graphs based VS methods are also exploited. For are domain specific and their performance for surveillance videos
instance, Kuanar et al. [36] presented a VS approach with is limited. Considering this, we investigated deep features for
intelligent utilization of bipartite graph matching for modelling shot segmentation to intelligently divide the video stream into
inter-view dependencies in multi-view videos and optimum-path meaningful shots. Our approach extracts deep features from the
forest scheme for clustering. In continuation with this method, two consecutive frames to determine whether the underlying
[37] used graph-assisted hierarchical method for VS. frames belong to the same or different shot. Features are
extracted from the fully connected layer (FC7) of CNN model
In an attempt to extend the usefulness of VS, Bagheri et al.
which is trained using MobileNet architecture (version 2) on
[38] proposed a method for temporal mapping of surveillance
ImageNet dataset. This model is originally trained for
videos for indexing and searching. Kannan et al. [39] enriched
classification but has learned global discriminative features,
the summarization pipeline with personalization for movies data.
which can be used for other purposes. After our analysis, we
Varini et al. [40] explored personalized VS for egocentric videos
found that deep features learned at higher full connected layer
with consideration of user preferences. Chen et al. [41] focused
T
(FC7) of this model, are suitable for shot segmentation, thus, we
on resource allocation for personalized VS. Hamza et al. [42]
used it in our framework.
used both personalization and privacy preservation for medical
IP
videos summarization. Sparse coding is also used in the After features extraction, the extracted deep features are
summarization pipeline. For instance, Mei et al. [43] used compared using Euclidean distance between them, whose
minimum sparse reconstruction strategy for summarization. formula is given in Equation 1.
CR
Sparse coding with context-awareness for surveillance videos is
presented by [44]. n
recommendation for future research. Sample keyframes with their memorability scores are shown in
2. The proposed framework Fig. 2. The frames having more objects and in the center have
received higher memorability scores compared to frames with
Motivated from the strength of CNNs for various applications, either few objects or objects at the sides of the frames.
AC
T
method and “Nextracted“ represents the number of extracted
keyframes in a summary.
IP
Fig. 2. Sample frames with memorability scores predicted by
trained image memorability model Recall: It corresponds to the possibility of extraction over all
Selecting keyframes using only memorability score may result ground truth keyframes.
CR
in a summary that may not adequately represent the entire video. N matched
Memorability does not preserve the heterogeneity of keyframes Re call (3)
N groundtruth
and thus we added entropy measure in our framework to maintain
Herein “Ngroundtruth” represents the number of keyframes in the
the diversity of our summary. Entropy of an image shows the
ground truth summary. The F-measure is calculated using Eq. 4.
amount of information. Thus, the features extraction phase
includes memorability prediction and computing entropy score.
2.3 Keyframe extraction
US
information inside it and its score is directly proportional to the
F
2 * (Pr ecision * Re call)
Pr ecision Re call
(4)
A set of representative videos are shown in Table 1 along with
their F-measure score. From Table 1, it can be observed that F-
AN
measure of the proposed system is much higher than every single
Keyframes extraction is the final step of VS pipeline for
existing technique on the same video. The last row in Table 1
which an attention curve is constructed using the computed
shows the average score of F-measure for each method under
features. In our case, we consider image memorability and
observation and shows the superiority of our scheme in
entropy as a metric to generate an attention curve for each shot of
M
eliminate the redundant frames. This step uses color histogram belonging to different genres ranging from 1 to 10 minutes. The
difference to discard frames of same shots that are selected as results of our method based on this dataset are compared with 5
keyframes. The advantage of this step is explained through an users generated summaries, VSUMM [24], and Fei et al. [54]
example in Section 3. method. The main difference between the proposed method and
PT
[54] is the way shots are segmented. They have used perceptual
3. Experimental results and discussion hashing for shot segmentation whereas we have employed deep
features which are much powerful than traditional methods. The
The proposed method is tested on two different videos
details of shot segmentation are already given in section 2.1. For
datasets and the results are compared with state of the art
CE
Table 2: Mean F-measure of each category sample video of the proposed method with VSUMM, [54], and user summaries
Users Proposed
T
Category Video # VSUMM [24] [54]
Worst Average Best Method
V11 0.67 0.74 0.77 0.68 0.82 0.82
IP
Cartoons
V12 0.65 0.71 0.76 0.65 0.67 0.73
News V88 0.68 0.76 0.84 0.67 0.75 0.78
Home V108 0.57 0.67 0.79 0.52 0.73 0.73
CR
Average F-measure 0.64 0.72 0.79 0.63 0.74 0.76
US
AN
M
ED
PT
Fig. 3. Video summary of our method, users, VSUMM [24], and [54] for video “v11”
CE
AC
Fig. 4. Video summary of our method, users, VSUMM [24], and [54] for video “v88”
5
ACCEPTED MANUSCRIPT
T
Detection in Surveillance Videos," IEEE Access, vol. 6,
pp. 18174-18183, 2018.
[3] K. Wang, Y. Shao, L. Shu, C. Zhu, and Y. Zhang,
IP
"Mobile big data fault-tolerant processing for ehealth
networks," IEEE Network, vol. 30, pp. 36-42, 2016.
[4] X. Chang, Z. Ma, Y. Yang, Z. Zeng, and A. G.
CR
Hauptmann, "Bi-level semantic representation analysis
for multimedia event detection," IEEE transactions on
cybernetics, vol. 47, pp. 1180-1197, 2017.
Fig. 5. Effect of post-processing on video summary. (a) original [5] X. Chang, Y.-L. Yu, Y. Yang, and E. P. Xing,
summary generated by our method (b) Summary after removing "Semantic pooling for complex event analysis in
redundancy through histogram difference untrimmed videos," IEEE transactions on pattern
entropy score prediction, and summary generation. Shot Framework for IoT systems using Probabilistic Image
Encryption," IEEE Transactions on Industrial
segmentation is the most critical step of video Informatics, 2018.
summarization, therefore, we performed it through deep [10] A. B. Mabrouk and E. Zagrouba, "Abnormal behavior
features to intelligently divide the surveillance video recognition for intelligent video surveillance systems: A
PT
into meaningful shots. Next, the memorability of each review," Expert Systems with Applications, 2017.
frame within the shot is predicted using a fine-tuned [11] X. Chang and Y. Yang, "Semisupervised feature
analysis by mining correlations among multiple tasks,"
image memorability prediction model, followed by IEEE transactions on neural networks and learning
entropy measure. Finally, the frame with highest systems, vol. 28, pp. 2294-2305, 2017.
CE
memorability and entropy score within each shot is [12] X. Chang, Z. Ma, M. Lin, Y. Yang, and A. G.
picked to constitute the final summary. The obtained Hauptmann, "Feature interaction augmented sparse
results based on two benchmark datasets show the learning for fast kinect motion detection," IEEE
Transactions on Image Processing, vol. 26, pp. 3911-
significance of this method for adaptability in resource-
3920, 2017.
AC
constrained surveillance networks for summarization. [13] K. Muhammad, M. Sajjad, and S. W. Baik, "Dual-Level
Security based Cyclic18 Steganographic Method and its
The current method can summarize video stream Application for Secure Transmission of Keyframes
with 18 fps. Further research is needed to improve this during Wireless Capsule Endoscopy," Journal of
processing rate and combine it with spectrum sensing Medical Systems, vol. 40, p. 114, 2016.
technologies for smarter surveillance. We also plan to [14] M. Sajjad, S. Khan, T. Hussain, K. Muhammad, A. K.
apply and extend this work to other resource constrained Sangaiah, A. Castiglione, et al., "CNN-based anti-
spoofing two-tier multi-factor authentication system,"
environments such as wireless sensor networks [55, 56] Pattern Recognition Letters, 2018/02/22/ 2018.
and internet of things [57, 58]. [15] I. Mehmood, M. Sajjad, W. Ejaz, and S. W. Baik,
"Saliency-directed prioritization of visual data in
Acknowledgment wireless surveillance networks," Information Fusion,
vol. 24, pp. 16-30, 2015.
6
ACCEPTED MANUSCRIPT
[16] A. Bradai, K. Singh, A. Rachedi, and T. Ahmed, [32] M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S.
"EMCOS: Energy-efficient mechanism for multimedia Chua, "Event driven web video summarization by tag
streaming over cognitive radio sensor networks," localization and key-shot identification," IEEE
Pervasive and Mobile Computing, vol. 22, pp. 16-32, Transactions on Multimedia, vol. 14, pp. 975-985,
2015. 2012.
[17] J. Almeida, N. J. Leite, and R. d. S. Torres, "Online [33] S. S. Thomas, S. Gupta, and V. K. Subramanian,
video summarization on compressed domain," Journal "Perceptual Video Summarization—A New Framework
of Visual Communication and Image Representation, for Video Summarization," IEEE Transactions on
vol. 24, pp. 729-738, 2013. Circuits and Systems for Video Technology, vol. 27, pp.
[18] S.-z. Wang, Z.-y. Wang, and R.-m. Hu, "Surveillance 1790-1802, 2017.
video synopsis in the compressed domain for fast video [34] H. Rabbouch, F. Saâdaoui, and R. Mraihi,
browsing," Journal of Visual Communication and "Unsupervised video summarization using cluster
Image Representation, vol. 24, pp. 1431-1442, 2013. analysis for automatic vehicles counting and
T
[19] J. Almeida, N. J. Leite, and R. d. S. Torres, "Vison: recognizing," Neurocomputing, vol. 260, pp. 157-173,
Video summarization for online applications," Pattern 2017.
Recognition Letters, vol. 33, pp. 397-409, 2012. [35] Y. Zhang, R. Tao, and Y. Wang, "Motion-state-
IP
[20] P. Mundur, Y. Rao, and Y. Yesha, "Keyframe-based adaptive video summarization via spatiotemporal
video summarization using Delaunay clustering," analysis," IEEE Transactions on Circuits and Systems
International Journal on Digital Libraries, vol. 6, pp. for Video Technology, vol. 27, pp. 1340-1352, 2017.
CR
219-232, 2006. [36] S. K. Kuanar, K. B. Ranga, and A. S. Chowdhury,
[21] M. S. Drew and J. Au, "Clustering of compressed "Multi-view video summarization using bipartite
illumination-invariant chromaticity signatures for matching constrained optimum-path forest clustering,"
efficient video summarization," Image and Vision IEEE Transactions on Multimedia, vol. 17, pp. 1166-
Computing, vol. 21, pp. 705-716, 2003. 1173, 2015.
[22] D. P. Papadopoulos, V. S. Kalogeiton, S. A. [37] L. dos Santos Belo, C. A. Caetano Jr, Z. K. G. do
[23]
Chatzichristofis, and N. Papamarkos, "Automatic
summarization and annotation of videos with lack of
metadata information," Expert Systems with
Applications, vol. 40, pp. 5765-5778, 2013.
M. Furini, F. Geraci, M. Montangero, and M.
US [38]
Patrocínio Jr, and S. J. F. Guimarães, "Summarizing
video sequence using a graph-based hierarchical
approach," Neurocomputing, vol. 173, pp. 1001-1016,
2016.
S. Bagheri, J. Y. Zheng, and S. Sinha, "Temporal
AN
Pellegrini, "STIMO: STIll and MOving video mapping of surveillance video for indexing and
storyboard for the web scenario," Multimedia Tools and summarization," Computer Vision and Image
Applications, vol. 46, pp. 47-69, 2010. Understanding, vol. 144, pp. 237-257, 2016.
[24] S. E. F. De Avila, A. P. B. Lopes, A. da Luz, and A. de [39] R. Kannan, G. Ghinea, and S. Swaminathan, "What do
Albuquerque Araújo, "VSUMM: A mechanism you wish to see? A summarization system for movies
designed to produce static video summaries and a novel based on user preferences," Information Processing &
M
evaluation method," Pattern Recognition Letters, vol. Management, vol. 51, pp. 286-305, 2015.
32, pp. 56-68, 2011. [40] P. Varini, G. Serra, and R. Cucchiara, "Personalized
[25] K. M. Mahmoud, M. A. Ismail, and N. M. Ghanem, Egocentric Video Summarization of Cultural Tour on
"Vscan: an enhanced video summarization using User Preferences Input," IEEE Transactions on
density-based spatial clustering," in International Multimedia, vol. 19, pp. 2832-2845, 2017.
ED
conference on image analysis and processing, 2013, pp. [41] F. Chen, C. De Vleeschouwer, and A. Cavallaro,
733-742. "Resource allocation for personalized video
[26] B. Xu, X. Wang, and Y.-G. Jiang, "Fast Summarization summarization," IEEE Transactions on Multimedia,
of User-Generated Videos: Exploiting Semantic, vol. 16, pp. 455-469, 2014.
Emotional, and Quality Clues," IEEE MultiMedia, vol. [42] R. Hamza, K. Muhammad, Z. Lv, and F. Titouna,
PT
7
ACCEPTED MANUSCRIPT
T
"Efficient visual attention driven framework for key
frames extraction from hysteroscopy videos,"
Biomedical Signal Processing and Control, vol. 33, pp.
IP
161-168, 2017.
[53] J. D. Mitchell-Jackson, "Energy needs in an internet
economy: A closer look at data centers," Masters of
CR
Science Thesis from the Energy and Resources Group,
University of California at Berkeley, 2001.
[54] M. Fei, W. Jiang, and W. Mao, "Memorable and rich
video summarization," Journal of Visual
Communication and Image Representation, vol. 42, pp.
207-217, 2017.
[55] G. Han, L. Liu, S. Chan, R. Yu, and Y. Yang,
"HySense: A hybrid mobile crowdsensing framework
for sensing opportunities compensation under dynamic
coverage constraint," IEEE Communications Magazine,
vol. 55, pp. 93-99, 2017.
US
AN
[56] Y. Wu, G. Min, K. Li, and B. Javadi, "Modeling and
analysis of communication networks in multicluster
systems under spatio-temporal bursty traffic," IEEE
Transactions on Parallel and Distributed Systems, vol.
23, pp. 902-912, 2012.
[57] G. Han, L. Zhou, H. Wang, W. Zhang, and S. Chan, "A
M