Violence Detection Paper
Violence Detection Paper
Surveillance Videos
                               Piotr Bilinski, Francois F Bremond
                          Abstract                               world videos. Instead, [7, 9, 21, 23] have focused on local
                                                                 features and the bag-of-features approach; the main differ-
    In this paper, we focus on the important topic of violence   ence between these techniques lies in the type of features
recognition and detection in surveillance videos. Our goal       used. [23] have applied the STIP and SIFT, and [7] have
is to determine if a violence occurs in a video (recognition)    used the STIP and MoSIFT features. [9] have proposed the
and when it happens (detection). Firstly, we propose an          VIolence Flows descriptor, encoding how flow-vector mag-
extension of the Improved Fisher Vectors (IFV) for videos,       nitudes change over time. [21] have proposed a video de-
which allows to represent a video using both local features      scriptor based on substantial derivative. Despite recent im-
and their spatio-temporal positions. Then, we study the          provements in violence recognition and detection, effective
popular sliding window approach for violence detection,          solutions for real-world situations are still unavailable.
and we re-formulate the Improved Fisher Vectors and use             The Improved Fisher Vectors (IFV) [24] is a bag-of-
the summed area table data structure to speed up the ap-         features-like video encoding strategy which has shown to
proach. We present an extensive evaluation, comparison           outperform the standard bag-of-features. It is a video (and
and analysis of the proposed improvements on 4 state-of-         image) descriptor obtained by pooling local features into a
the-art datasets. We show that the proposed improvements         global representation. It describes local features by their
make the violence recognition more accurate (as compared         deviation from the “universal” generative Gaussian Mixture
to the standard IFV, IFV with spatio-temporal grid, and          Model. The IFV has been widely applied for recognition
other state-of-the-art methods) and make the violence de-        tasks in videos [1, 2, 11, 27, 28]. One of the main draw-
tection significantly faster.                                    backs of the IFV is that it simplifies the structure of a video
                                                                 assuming conditional independence across spatial and tem-
                                                                 poral domains; it computes global statistics of local features
1. Introduction                                                  only, ignoring spatio-temporal positions of features.
    Video surveillance cameras are part of our lives. They          Clearly, spatial information may contain useful infor-
are used almost everywhere, e.g. at streets, subways, train      mation. A common way to use spatio-temporal informa-
and bus stations, airports, and sport stadiums. Today’s in-      tion with IFV is to use either spatio-temporal grids [16] or
crease in threats to security in cities and towns around the     multi-scale pyramids [17]; however, these methods are still
world makes the use of video cameras to monitor people           limited in terms of a detailed description providing only a
necessary. The attacks on humans, fights, and vandalism          coarse representation. There are several other state-of-the-
are just some cases where detection, particularly violence       art methods [13, 14, 19, 25], but as they were proposed
detection, systems are needed.                                   for images (for image categorization and object recogni-
    In this paper, we focus on the important topic of violence   tion), they cannot be directly applied for videos; moreover,
recognition and detection in surveillance videos. Our goal       [13, 14] achieve similar results as compared to the spatial
is to determine if a violence occurs in a video (recognition)    grids/pyramids, and [19] is parameter sensitive and requires
and when it happens (detection).                                 additional parameter learning.
    Over the last years, several violence recognition and de-       As opposed to the existing violence recognition and de-
tection techniques have been proposed. [6] have used mo-         tection methods (which focus mainly on new descriptors),
tion trajectory information and orientation information of       we focus on a video representation model due to two rea-
person’s limbs for person-on-person fight detection. One of      sons: to make it more accurate for violence recognition,
the main drawbacks of this approach is that it requires pre-     and to make it faster for violence detection. Firstly, we
cise segmentation, which is very difficult to obtain in real     propose an extension of the IFV for videos (Sec. 2.2),
978-1-5090-3811-4/16/$31.00 ©2016 IEEE                                     IEEE AVSS 2016, August 2016, Colorado Springs, CO, USA
which allows to use spatio-temporal positions of features                of all the K gradient vectors G X         D
                                                                                                         µ,i ∈ R and all the K gra-
with the IFV. The proposed extension boosts the IFV and                                  X
                                                                         dient vectors G σ,i ∈ RD , i = 1 . . . K:
achieves better or similar accuracy (keeping the represen-
                                                                                                                                          0
tation more compact) as compared to the spatio-temporal                             GX       X       X               X       X
                                                                                     λ = [ G µ,1 , G σ,1 , . . . , G µ,K , G σ,K ] .                 (4)
grids. Then, we study and evaluate the popular sliding win-
dow approach [12] for violence detection. We re-formulate                  The IFV (Improved Fisher Vectors) representation ΦX λ,
the IFV and use the summed area table data structure to                  ΦX
                                                                          λ ∈ R2DK , is the gradient vector G X
                                                                                                              λ normalized by the
speed up the sliding window method (Sec. 2.3). Then, we                  power normalization and then the L2 norm:
present an extensive evaluation, comparison and analysis                                                                              0
of the proposed improvements on 4 state-of-the-art datasets                      ΦX       X       X               X       X
                                                                                  λ = [ G µ,1 , G σ,1 , . . . , G µ,K , G σ,K ] +l2 .
                                                                                                                               √                     (5)
(Sec. 3 and Sec. 4). Finally, we conclude in Sec. 5.
Abnormal behavior detection: There are several methods                   2.2. Boosting the IFV with spatio-temporal inf.
for abnormal behavior and anomaly detection [4, 18, 20,                      The Improved Fisher Vectors encoding simplifies the
22]. However, abnormalities do not represent a compact and               structure of a video assuming conditional independence
well defined concept. Abnormality detection is a different               across spatial and temporal domains (see Sec. 2.1). It com-
research topic, with different constraints and assumptions,              putes global statistics of local features only, ignoring spatio-
and therefore we do not focus on these techniques.                       temporal positions of features. Thus, we propose an ex-
                                                                         tension of the Improved Fisher Vectors which incorporates
2. Boosting the Improved Fisher Vectors (IFV)                            spatio-temporal positions of features into the video model.
                                                                             Firstly, we represent positions of local features in a video
2.1. State-of-The-Art: Improved Fisher Vectors
                                                                         normalized manner. In this paper, we focus on local tra-
    This section provides a brief description of the Improved            jectories only; however, the following representation can
Fisher Vectors, introduced in [24]. The mathematical no-                 also be applied to spatio-temporal interest points [8, 15, 29]
tations and formulas provided here are in accordance with                (with assumptions: pt = (at,1 , bt,1 , ct,1 ) is the spatio-
[24], and we refer to it for more details.                               temporal position of a point and nt = 1).
    Let X = {xt , t = 1 . . . T } be a set of T local features ex-           Let P = {pt , t = 1 . . . T } be a set of T tra-
tracted from a video, where each local feature is of dimen-              jectories extracted from a video sequence and pt =
sion D, xt ∈ RD . Let λ = {wi , µi , Σi , i = 1 . . . K} be              ((at,1 , bt,1 , ct,1 ), . . . , (at,nt , bt,nt , ct,nt )) is a sample trajec-
parameters of a Gaussian Mixture Model (GMM): uλ (x) =                   tory, where a feature point detected at a spatial position
PK                                                                       (at,1 , bt,1 ) in a frame ct,1 is tracked in nt ≥ 1 subsequent
   i=1 wi ui (x) fitting the distribution of local features,
where wi ∈ R, µi ∈ RD and Σi ∈ RD×D are respectively                     frames until a spatial position (at,nt , bt,nt ) in a frame ct,nt .
the mixture weight, mean vector and covariance matrix of                 We define the video normalized position p̂t of a center of a
the i-th Gaussian ui . We assume that the covariance matri-              trajectory pt as:
ces are diagonal and we denote by σ2i the variance vector,                         "                                                          #0
                                                                                        nt               nt               nt
i.e. Σi = diag(σ2i ), σ2i ∈ RD .                                                   1 X              1 X              1 X
                                                                           p̂t =           at,i ,           bt,i ,           ct,i                  , (6)
    Moreover, let γt (i) be the soft assignment of a descriptor                  vw nt i=1        vh nt i=1        vl nt i=1
xt to a Gaussian i:
                                                                         where vw is the video width (with the units in pixels), vh
                                wi ui (xt )                              is the video height (in pixels), and vl is the video length
                  γt (i) = PK                       ,              (1)
                               j=1   wj uj (xt )                         (number of frames). We normalize the position of a center
                                                                         of a trajectory, so that the video size does not significantly
and let G X            X                                                 change the magnitude of the feature position vector.
          µ,i (resp. G σ,i ) be the gradient w.r.t. the mean µi
(resp. standard deviation σi ) of a Gaussian i:                              Once positions of local features are represented in a
                                                                         video normalized manner, we also consider using the unity
                      1
                             T
                             X              
                                                x t − µi
                                                                        based normalization to reduce the influence of motionless
           GX
            µ,i =     √            γt (i)                      ,   (2)   regions at the boundaries of a video, so that the large mo-
                    T wi                            σi
                             t=1                                         tionless regions do not significantly change the magnitude
                                                                         of the feature position vector. Let p̂t,i be the i-th dimen-
                       T
                                   (xt − µi )2                           sion of the vector p̂t and min(p̂:,i ) (resp. max(p̂:,i )) be
                                                  
                  1   X
       GX
        σ,i =    √        γt (i)               − 1   ,             (3)   the minimum (resp. maximum) value of the i-th dimen-
                T 2wi t=1              σ2i
                                                                         sion among all the video normalized position vectors ex-
where the division between vectors is as a term-by-term op-              tracted from the training videos. When the condition ∀i :
eration. Then, the gradient vector G X
                                     λ is the concatenation              min(p̂:,i ) 6= max(p̂:,i ) is true, we can apply the unity based
                                                                                                                                   time                     -
normalization to calculate the vector p̃t . The i-th dimension
of the vector p̃t is:                                                                                                                     ···
                              p̂t,i − min(p̂:,i )                                s1      s2           s3         s4   s5   s6   s7    s8 · · ·
                p̃t,i =                               .                  (7)      r       r            r          r 6
                            max(p̂:,i ) − min(p̂:,i )                                     r            r          r    r 6
                                                                                                       r          r    r    r 6
    Then, we incorporate the normalized positions of local                                                        r    r    r    r 6
                                                                                                                       r    r    r     r 6
features into the Improved Fisher Vectors model, so that                          ?       ?           ?           ?    ?    ?    ?     ?
videos are represented using both local descriptors and their                     1       2           3          4    4    3    2     1
spatio-temporal positions.                                                                                                         P
                                                                                                                                     = 20
    Let Y = {yt = [p̃t , xt ], t = 1 . . . T } be a set of lo-
cal features, where xt ∈ RD is a local feature descriptor                      Figure 1. Temporal sliding window: a sample video is divided into
and p̃t ∈ RE is its corresponding normalized position, typ-                    n ≥ 8 segments. We use m = 4 window scales. Note that the IFV
                                                                               are calculated for features from the same segments multiple times
ically E = 3, calculated as above. Let λ̃ = {w̃i , µ̃i , Σ̃i , i =
                                               PK                              (20 times for 8 segments).
1 . . . K} be parameters of a GMM uλ̃ (y) = i=1 w̃i ui (y)
fitting the distribution of local features, where w̃i ∈ R,
µ̃i ∈ RD+E and Σ̃i ∈ R(D+E)×(D+E) are respectively                             ∀N
                                                                                j6=k, j,k=1 Xj ∩ Xk = ∅ and φ(j, k) → t is the mapping
the mixture weight, mean vector and covariance matrix of                       function such that xj,k = xt .
the i-th Gaussian. As before, we assume that the covariance                    We re-write Eq. (2):
matrices are diagonal and we denote by σ̃2i the variance vec-
tor, i.e. Σ̃i = diag(σ̃2i ), σ̃2i ∈ RD+E . We calculate G Y     µ̃,i                              1
                                                                                                           T
                                                                                                           X              
                                                                                                                              x t − µi
                                                                                                                                         
(Eq. 2) and G Y      (Eq. 3)  for all K Gaussian components,                      GX
                                                                                   µ,i =          √              γt (i)
                σ̃,i                                                                          T wi         t=1
                                                                                                                                  σi
and concatenate all the gradient vectors into a vector G Y      λ̃
                                                                   .
                                                                                                           N |X j|
                                                                                                                                         xφ(j,k) − µi
                                                                                                                                                       
Finally, the new Improved Fisher Vectors representation is                                        1        X X
the gradient vector G Y   normalized by the power normaliza-                             =        √                   γφ(j,k) (i)
                       λ̃                                                                     T wi         j=1 k=1
                                                                                                                                              σi
tion and then the L2 norm:
                                                                                                  N                              N
                                                               0                              1   X         X                 1 X Xj                         (9)
       ΦY    =[   GY
                   µ̃,1 ,   GY
                             σ̃,1 , . . . ,   GY
                                               µ̃,K ,   GY
                                                         σ̃,K ] +l2 .
                                                               √         (8)             =              G µ,ij |Xj | =             H ,
        λ̃                                                                                    T   j=1
                                                                                                                              T j=1 µ,i
                                                                                  X            X
2.3. Fast IFV-based Sliding Window                                               Hµ,ij   =   G µ,ij |Xj |
    Our goal is to determine if a violence occurs in a video                                   |Xj |
                                                                                                                   xφ(j,k) − µi
                                                                                                                               
                                                                                           1 X
and when it happens; therefore, we search for a range of                                 =√          γφ(j,k) (i)                  .
                                                                                            wi                          σi
frames which contains violence. We base our approach                                                  k=1
on the temporal sliding window [12] which evaluates video
sub-sequences at varying locations and scales.                                 Similarly, we re-write Eq. (3):
    Let vl be a video length (in frames), s > 0 be the win-                                     T
                                                                                                            (xt − µi )2
                                                                                                                          
dow step size (in frames), and w = {is}i=1...m be tem-                                     1   X
                                                                                GX
                                                                                 σ,i =    √        γt (i)               − 1
poral window sizes (scales) for the sliding window algo-                                 T 2wi t=1              σ2i
rithm. Moreover, let v = ns be an approximated video                                           N |X j|
                                                                                                                     (xφ(j,k) − µi )2
                                                                                                                                       
length (where: ns ≥ vl > (n − 1)s and n ≥ m ≥ 1). Vi-                                     1   X   X
                                                                                      =  √             γφ(j,k) (i)                    −1
sualization of a sample video and sample sliding windows                                T 2wi j=1 k=1                      σ2i
is presented in Fig. 1. Note the IFV are calculated for fea-                               N              N
tures from the same temporal segments multiple times, i.e.                              1 X Xj         1 X Xj
                                                                                      =      G |Xj | =      H ,
m(n − m + 1) times for m segments (e.g. Fig. 1: 20 times                                T j=1 σ,i      T j=1 σ,i
for 8 segments). Therefore, to speed up the detection frame-                      X          X
work, we re-formulate the IFV and use the summed area ta-                      Hσ,ij = G σ,ij |Xj |
ble data structure, so that the IFV are calculated for features                              |Xj |
                                                                                                                 (xφ(j,k) − µi )2
                                                                                                                                   
                                                                                         1 X
from the temporal segments only ones.                                                 =√           γφ(j,k) (i)                    −1 .
    Let X = {xt , t = 1 . . . T } be a set of T local features                           2wi k=1                       σ2i
extracted from a video. Let X0 = {Xj , j = 1 . . . N } be a                                                                                                 (10)
                                                     |Xj |
partition of a set X into N subsets Xj = {xj,k }k=1        such
                                                                                                                                          X
                                                        N
                                                        S                      Then, let’s define the gradient vector Hλ j as a concatena-
that: |Xj | is the cardinality of the set Xj , X =          Xj ,                                                    X
                                                                   j=1         tion of all the K gradient vectors Hµ,ij and all the K gradi-
                X
ent vectors Hσ,ij , i = 1 . . . K:
       X            X       X             X      X         0
     Hλ j = [ Hµ,1j , Hσ,1j , . . . , Hµ,K
                                         j      j
                                           , Hσ,K ].           (11)
Table 2. Comparison with the state-of-the-art on the Violent-Flows (left table), Hockey Fight (middle), and Movies (right) datasets.
Figure 3. ROC curves: our approach (on the left) vs. the state-of-
the-art (on the right) on the Violent-Flows 21 dataset.                 5. Conclusions
 Approach       LTP     HOG       HOF      HNF      VIF     Ours           We have proposed an extension of the Improved Fisher
  AUC           79.9    61.8      57.6     59.9     85.0    87.0        Vectors (IFV) for violence recognition in videos, which
                                                                        allows to represent a video using both local features and
    Table 3. AUC metric on the Violent-Flows 21 dataset [9].            their spatio-temporal positions. The proposed extension has
                                                                        shown to boost the IFV achieving better or similar accuracy
           Process                    Processing Time (fps)             (and keeping the representation more compact) as compared
   Feature Extraction (IDT)                    5.7                      to the IFV with spatio-temporal grid. Moreover, our ap-
       Sliding Window                         9.28                      proach has shown to significantly outperform the existing
  Ours: Fast Sliding Window                   99.21                     techniques on three violence recognition datasets. Then, we
                                                                        have studied the popular sliding window approach for vio-
Table 4. Average processing time on the Violent-Flows 21 dataset        lence detection. We have re-formulated the IFV and have
using a single Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz.               used the summed area table data structure to significantly
                                                                        speed up the violence detection framework. The evaluations
curacy (keeping the representation more compact) as com-                have been performed on 4 state-of-the-art datasets.
pared to the IFV with spatio-temporal grids. Moreover, our
approach significantly outperforms the existing techniques
on all three violence recognition datasets.                             Acknowledgements.
                                                                           The research leading to these results has received fund-
4.4. Results: Violence Detection
                                                                        ing from the People Programme (Marie Curie Actions)
   We evaluate our Fast Sliding Window-based approach on                of the European Union’s Seventh Framework Programme
the Violence-Flows 21 dataset.                                          FP7/2007-2013/ under REA grant agreement no [324359].
   Firstly, we evaluate the accuracy of the sliding window /            However, the views and opinions expressed herein do not
Fast Sliding Window approach (both techniques achieve the               necessarily reflect those of the financing institutions.
References                                                             [16] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.
                                                                            Learning realistic human actions from movies. In IEEE
 [1] P. Bilinski and F. Bremond. Video Covariance Matrix Log-               Conference on Computer Vision and Pattern Recognition
     arithm for Human Action Recognition in Videos. In Inter-               (CVPR), 2008.
     national Joint Conference on Artificial Intelligence (IJCAI),     [17] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Fea-
     2015.                                                                  tures: Spatial Pyramid Matching for Recognizing Natural
 [2] P. Bilinski, M. Koperski, S. Bak, and F. Bremond. Repre-               Scene Categories. In IEEE Conference on Computer Vision
     senting Visual Appearance by Video Brownian Covariance                 and Pattern Recognition (CVPR), 2006.
     Descriptor for Human Action Recognition. In IEEE Inter-           [18] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos.
     national Conference on Advanced Video and Signal-Based                 Anomaly Detection in Crowded Scenes. In IEEE Confer-
     Surveillance (AVSS), 2014.                                             ence on Computer Vision and Pattern Recognition (CVPR),
 [3] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support               2010.
     Vector Machines. ACM Transactions on Intelligent Systems          [19] S. McCann and D. G. Lowe. Spatially Local Coding for Ob-
     and Technology (TIST), 2(3):27, 2011.                                  ject Recognition. In Asian Conference on Computer Vision
 [4] X. Cui, Q. Liu, M. Gao, and D. N. Metaxas. Abnormal De-                (ACCV), 2012.
     tection Using Interaction Energy Potentials. In IEEE Confer-      [20] R. Mehran, A. Oyama, and M. Shah. Abnormal Crowd Be-
     ence on Computer Vision and Pattern Recognition (CVPR),                havior Detection using Social Force Model. In IEEE Confer-
     2011.                                                                  ence on Computer Vision and Pattern Recognition (CVPR),
 [5] O. Daniz, I. Serrano, G. Bueno, and T.-K. Kim. Fast Vi-                2009.
     olence Detection in Video. In International Conference on         [21] S. Mohammadi, H. Kiani, A. Perina, and V. Murino. Vio-
     Computer Vision Theory and Applications (VISAPP), 2014.                lence detection in crowded scenes using substantial deriva-
 [6] A. Datta, M. Shah, and N. D. V. Lobo. Person-on-Person Vi-             tive. In IEEE International Conference on Advanced Video
     olence Detection in Video Data. In International Conference            and Signal-Based Surveillance (AVSS), 2015.
     on Pattern Recognition (ICPR), 2002.                              [22] H. Mousavi, S. Mohammadi, A. Perina, R. Chellali, and
 [7] F. D. de Souza, G. C. Chavez, E. A. do Valle, and                      V. Murino. Analyzing Tracklets for the Detection of Ab-
     A. de A Araujo. Violence Detection in Video Using Spatio-              normal Crowd Behavior. In IEEE Winter Conference on Ap-
     Temporal Features. In SIBGRAPI Conference on Graphics,                 plications of Computer Vision (WACV), 2015.
     Patterns and Images, 2010.                                        [23] E. B. Nievas, O. D. Suarez, G. B. Garcia, and R. Sukthankar.
 [8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior           Violence Detection in Video Using Computer Vision Tech-
     Recognition via Sparse Spatio-Temporal Features. In IEEE               niques. In International Conference on Computer Analysis
     International Workshop on Visual Surveillance and Perfor-              of Images and Patterns (CAIP), 2011.
     mance Evaluation of Tracking and Surveillance, 2005.              [24] F. Perronnin, J. Sanchez, and T. Mensink. Improving the
 [9] T. Hassner, Y. Itcher, and O. Kliper-Gross. Violent Flows:             Fisher Kernel for Large-Scale Image Classification. In Euro-
     Real-Time Detection of Violent Crowd Behavior. In IEEE                 pean Conference on Computer Vision (ECCV), 2010.
     Conference on Computer Vision and Pattern Recognition             [25] J. Sanchez, F. Perronnin, and T. De Campos. Modeling the
     (CVPR) Workshops, 2012.                                                Spatial Layout of Images Beyond Spatial Pyramids. Pattern
[10] C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A Practical Guide to            Recognition Letters, 33(16):2216–2223, 2012.
     Support Vector Classification. Technical report, Department       [26] O. Tuzel, F. Porikli, and P. Meer. Region Covariance: A Fast
     of Computer Science, National Taiwan University, 2003.                 Descriptor for Detection And Classification. In European
[11] V. Kantorov and I. Laptev. Efficient feature extraction, en-           Conference on Computer Vision (ECCV), 2006.
     coding and classification for action recognition. In IEEE         [27] H. Wang and C. Schmid. Action Recognition with Improved
     Conference on Computer Vision and Pattern Recognition                  Trajectories. In IEEE International Conference on Computer
     (CVPR), 2014.                                                          Vision (ICCV), 2013.
[12] A. Klaser, M. Marszalek, C. Schmid, and A. Zisserman. Hu-         [28] L. Wang, Y. Qiao, and X. Tang. Action Recognition with
     man Focused Action Localization in Video. In Trends and                Trajectory-Pooled Deep-Convolutional Descriptors. In IEEE
     Topics in Computer Vision, pages 219–233. Springer, 2010.              Conference on Computer Vision and Pattern Recognition
[13] J. Krapac, J. Verbeek, and F. Jurie. Modeling Spatial Lay-             (CVPR), 2015.
     out with Fisher Vectors for Image Categorization. In IEEE         [29] G. Willems, T. Tuytelaars, and L. Van Gool. An Efficient
     International Conference on Computer Vision (ICCV), 2011.              Dense and Scale-Invariant Spatio-Temporal Interest Point
[14] J. Krapac, J. Verbeek, and F. Jurie. Spatial Fisher Vectors for        Detector. In European Conference on Computer Vision
     Image Categorization. Research Report RR-7680, INRIA,                  (ECCV), 2008.
     2011.                                                             [30] L. Yeffet and L. Wolf. Local Trinary Patterns for human ac-
[15] I. Laptev. On Space-Time Interest Points. International Jour-          tion recognition. In IEEE International Conference on Com-
     nal of Computer Vision (IJCV), 64(2-3):107–123, 2005.                  puter Vision (ICCV), 2009.