0% found this document useful (0 votes)
34 views8 pages

Violence Detection Paper

This document discusses techniques for recognizing and detecting human violence in surveillance videos. It proposes extending the Improved Fisher Vectors (IFV) method to incorporate spatial and temporal information about local features in videos. This boosted IFV method achieves better accuracy than techniques using spatial grids or pyramids. It also reframes the IFV for faster violence detection using a sliding window approach and summed area tables. An evaluation of these methods on four datasets finds they improve violence recognition accuracy and speed up detection compared to standard IFV and other state-of-the-art approaches.

Uploaded by

Sheetal Sonawane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views8 pages

Violence Detection Paper

This document discusses techniques for recognizing and detecting human violence in surveillance videos. It proposes extending the Improved Fisher Vectors (IFV) method to incorporate spatial and temporal information about local features in videos. This boosted IFV method achieves better accuracy than techniques using spatial grids or pyramids. It also reframes the IFV for faster violence detection using a sliding window approach and summed area tables. An evaluation of these methods on four datasets finds they improve violence recognition accuracy and speed up detection compared to standard IFV and other state-of-the-art approaches.

Uploaded by

Sheetal Sonawane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Human Violence Recognition and Detection in

Surveillance Videos
Piotr Bilinski, Francois F Bremond

To cite this version:


Piotr Bilinski, Francois F Bremond. Human Violence Recognition and Detection in Surveillance
Videos. AVSS, Aug 2016, Colorado, United States. �hal-01849284�

HAL Id: hal-01849284


https://inria.hal.science/hal-01849284
Submitted on 25 Jul 2018

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Human Violence Recognition and Detection in Surveillance Videos

Piotr Bilinski and Francois Bremond


INRIA Sophia Antipolis, STARS team
2004 Route des Lucioles, BP93, 06902 Sophia Antipolis, France
{Piotr.Bilinski,Francois.Bremond}@inria.fr

Abstract world videos. Instead, [7, 9, 21, 23] have focused on local
features and the bag-of-features approach; the main differ-
In this paper, we focus on the important topic of violence ence between these techniques lies in the type of features
recognition and detection in surveillance videos. Our goal used. [23] have applied the STIP and SIFT, and [7] have
is to determine if a violence occurs in a video (recognition) used the STIP and MoSIFT features. [9] have proposed the
and when it happens (detection). Firstly, we propose an VIolence Flows descriptor, encoding how flow-vector mag-
extension of the Improved Fisher Vectors (IFV) for videos, nitudes change over time. [21] have proposed a video de-
which allows to represent a video using both local features scriptor based on substantial derivative. Despite recent im-
and their spatio-temporal positions. Then, we study the provements in violence recognition and detection, effective
popular sliding window approach for violence detection, solutions for real-world situations are still unavailable.
and we re-formulate the Improved Fisher Vectors and use The Improved Fisher Vectors (IFV) [24] is a bag-of-
the summed area table data structure to speed up the ap- features-like video encoding strategy which has shown to
proach. We present an extensive evaluation, comparison outperform the standard bag-of-features. It is a video (and
and analysis of the proposed improvements on 4 state-of- image) descriptor obtained by pooling local features into a
the-art datasets. We show that the proposed improvements global representation. It describes local features by their
make the violence recognition more accurate (as compared deviation from the “universal” generative Gaussian Mixture
to the standard IFV, IFV with spatio-temporal grid, and Model. The IFV has been widely applied for recognition
other state-of-the-art methods) and make the violence de- tasks in videos [1, 2, 11, 27, 28]. One of the main draw-
tection significantly faster. backs of the IFV is that it simplifies the structure of a video
assuming conditional independence across spatial and tem-
poral domains; it computes global statistics of local features
1. Introduction only, ignoring spatio-temporal positions of features.
Video surveillance cameras are part of our lives. They Clearly, spatial information may contain useful infor-
are used almost everywhere, e.g. at streets, subways, train mation. A common way to use spatio-temporal informa-
and bus stations, airports, and sport stadiums. Today’s in- tion with IFV is to use either spatio-temporal grids [16] or
crease in threats to security in cities and towns around the multi-scale pyramids [17]; however, these methods are still
world makes the use of video cameras to monitor people limited in terms of a detailed description providing only a
necessary. The attacks on humans, fights, and vandalism coarse representation. There are several other state-of-the-
are just some cases where detection, particularly violence art methods [13, 14, 19, 25], but as they were proposed
detection, systems are needed. for images (for image categorization and object recogni-
In this paper, we focus on the important topic of violence tion), they cannot be directly applied for videos; moreover,
recognition and detection in surveillance videos. Our goal [13, 14] achieve similar results as compared to the spatial
is to determine if a violence occurs in a video (recognition) grids/pyramids, and [19] is parameter sensitive and requires
and when it happens (detection). additional parameter learning.
Over the last years, several violence recognition and de- As opposed to the existing violence recognition and de-
tection techniques have been proposed. [6] have used mo- tection methods (which focus mainly on new descriptors),
tion trajectory information and orientation information of we focus on a video representation model due to two rea-
person’s limbs for person-on-person fight detection. One of sons: to make it more accurate for violence recognition,
the main drawbacks of this approach is that it requires pre- and to make it faster for violence detection. Firstly, we
cise segmentation, which is very difficult to obtain in real propose an extension of the IFV for videos (Sec. 2.2),

978-1-5090-3811-4/16/$31.00 ©2016 IEEE IEEE AVSS 2016, August 2016, Colorado Springs, CO, USA
which allows to use spatio-temporal positions of features of all the K gradient vectors G X D
µ,i ∈ R and all the K gra-
with the IFV. The proposed extension boosts the IFV and X
dient vectors G σ,i ∈ RD , i = 1 . . . K:
achieves better or similar accuracy (keeping the represen-
0
tation more compact) as compared to the spatio-temporal GX X X X X
λ = [ G µ,1 , G σ,1 , . . . , G µ,K , G σ,K ] . (4)
grids. Then, we study and evaluate the popular sliding win-
dow approach [12] for violence detection. We re-formulate The IFV (Improved Fisher Vectors) representation ΦX λ,
the IFV and use the summed area table data structure to ΦX
λ ∈ R2DK , is the gradient vector G X
λ normalized by the
speed up the sliding window method (Sec. 2.3). Then, we power normalization and then the L2 norm:
present an extensive evaluation, comparison and analysis 0
of the proposed improvements on 4 state-of-the-art datasets ΦX X X X X
λ = [ G µ,1 , G σ,1 , . . . , G µ,K , G σ,K ] +l2 .
√ (5)
(Sec. 3 and Sec. 4). Finally, we conclude in Sec. 5.
Abnormal behavior detection: There are several methods 2.2. Boosting the IFV with spatio-temporal inf.
for abnormal behavior and anomaly detection [4, 18, 20, The Improved Fisher Vectors encoding simplifies the
22]. However, abnormalities do not represent a compact and structure of a video assuming conditional independence
well defined concept. Abnormality detection is a different across spatial and temporal domains (see Sec. 2.1). It com-
research topic, with different constraints and assumptions, putes global statistics of local features only, ignoring spatio-
and therefore we do not focus on these techniques. temporal positions of features. Thus, we propose an ex-
tension of the Improved Fisher Vectors which incorporates
2. Boosting the Improved Fisher Vectors (IFV) spatio-temporal positions of features into the video model.
Firstly, we represent positions of local features in a video
2.1. State-of-The-Art: Improved Fisher Vectors
normalized manner. In this paper, we focus on local tra-
This section provides a brief description of the Improved jectories only; however, the following representation can
Fisher Vectors, introduced in [24]. The mathematical no- also be applied to spatio-temporal interest points [8, 15, 29]
tations and formulas provided here are in accordance with (with assumptions: pt = (at,1 , bt,1 , ct,1 ) is the spatio-
[24], and we refer to it for more details. temporal position of a point and nt = 1).
Let X = {xt , t = 1 . . . T } be a set of T local features ex- Let P = {pt , t = 1 . . . T } be a set of T tra-
tracted from a video, where each local feature is of dimen- jectories extracted from a video sequence and pt =
sion D, xt ∈ RD . Let λ = {wi , µi , Σi , i = 1 . . . K} be ((at,1 , bt,1 , ct,1 ), . . . , (at,nt , bt,nt , ct,nt )) is a sample trajec-
parameters of a Gaussian Mixture Model (GMM): uλ (x) = tory, where a feature point detected at a spatial position
PK (at,1 , bt,1 ) in a frame ct,1 is tracked in nt ≥ 1 subsequent
i=1 wi ui (x) fitting the distribution of local features,
where wi ∈ R, µi ∈ RD and Σi ∈ RD×D are respectively frames until a spatial position (at,nt , bt,nt ) in a frame ct,nt .
the mixture weight, mean vector and covariance matrix of We define the video normalized position p̂t of a center of a
the i-th Gaussian ui . We assume that the covariance matri- trajectory pt as:
ces are diagonal and we denote by σ2i the variance vector, " #0
nt nt nt
i.e. Σi = diag(σ2i ), σ2i ∈ RD . 1 X 1 X 1 X
p̂t = at,i , bt,i , ct,i , (6)
Moreover, let γt (i) be the soft assignment of a descriptor vw nt i=1 vh nt i=1 vl nt i=1
xt to a Gaussian i:
where vw is the video width (with the units in pixels), vh
wi ui (xt ) is the video height (in pixels), and vl is the video length
γt (i) = PK , (1)
j=1 wj uj (xt ) (number of frames). We normalize the position of a center
of a trajectory, so that the video size does not significantly
and let G X X change the magnitude of the feature position vector.
µ,i (resp. G σ,i ) be the gradient w.r.t. the mean µi
(resp. standard deviation σi ) of a Gaussian i: Once positions of local features are represented in a
video normalized manner, we also consider using the unity
1
T
X 
x t − µi
 based normalization to reduce the influence of motionless
GX
µ,i = √ γt (i) , (2) regions at the boundaries of a video, so that the large mo-
T wi σi
t=1 tionless regions do not significantly change the magnitude
of the feature position vector. Let p̂t,i be the i-th dimen-
T
(xt − µi )2 sion of the vector p̂t and min(p̂:,i ) (resp. max(p̂:,i )) be
 
1 X
GX
σ,i = √ γt (i) − 1 , (3) the minimum (resp. maximum) value of the i-th dimen-
T 2wi t=1 σ2i
sion among all the video normalized position vectors ex-
where the division between vectors is as a term-by-term op- tracted from the training videos. When the condition ∀i :
eration. Then, the gradient vector G X
λ is the concatenation min(p̂:,i ) 6= max(p̂:,i ) is true, we can apply the unity based
time -
normalization to calculate the vector p̃t . The i-th dimension
of the vector p̃t is: ···
p̂t,i − min(p̂:,i ) s1 s2 s3 s4 s5 s6 s7 s8 · · ·
p̃t,i = . (7) r r r r 6
max(p̂:,i ) − min(p̂:,i ) r r r r 6
r r r r 6
Then, we incorporate the normalized positions of local r r r r 6
r r r r 6
features into the Improved Fisher Vectors model, so that ? ? ? ? ? ? ? ?
videos are represented using both local descriptors and their 1 2 3 4 4 3 2 1
spatio-temporal positions. P
= 20
Let Y = {yt = [p̃t , xt ], t = 1 . . . T } be a set of lo-
cal features, where xt ∈ RD is a local feature descriptor Figure 1. Temporal sliding window: a sample video is divided into
and p̃t ∈ RE is its corresponding normalized position, typ- n ≥ 8 segments. We use m = 4 window scales. Note that the IFV
are calculated for features from the same segments multiple times
ically E = 3, calculated as above. Let λ̃ = {w̃i , µ̃i , Σ̃i , i =
PK (20 times for 8 segments).
1 . . . K} be parameters of a GMM uλ̃ (y) = i=1 w̃i ui (y)
fitting the distribution of local features, where w̃i ∈ R,
µ̃i ∈ RD+E and Σ̃i ∈ R(D+E)×(D+E) are respectively ∀N
j6=k, j,k=1 Xj ∩ Xk = ∅ and φ(j, k) → t is the mapping
the mixture weight, mean vector and covariance matrix of function such that xj,k = xt .
the i-th Gaussian. As before, we assume that the covariance We re-write Eq. (2):
matrices are diagonal and we denote by σ̃2i the variance vec-
tor, i.e. Σ̃i = diag(σ̃2i ), σ̃2i ∈ RD+E . We calculate G Y µ̃,i 1
T
X 
x t − µi

(Eq. 2) and G Y (Eq. 3) for all K Gaussian components, GX
µ,i = √ γt (i)
σ̃,i T wi t=1
σi
and concatenate all the gradient vectors into a vector G Y λ̃
.
N |X j|
xφ(j,k) − µi
 
Finally, the new Improved Fisher Vectors representation is 1 X X
the gradient vector G Y normalized by the power normaliza- = √ γφ(j,k) (i)
λ̃ T wi j=1 k=1
σi
tion and then the L2 norm:
N N
0 1 X X 1 X Xj (9)
ΦY =[ GY
µ̃,1 , GY
σ̃,1 , . . . , GY
µ̃,K , GY
σ̃,K ] +l2 .
√ (8) = G µ,ij |Xj | = H ,
λ̃ T j=1
T j=1 µ,i
X X
2.3. Fast IFV-based Sliding Window Hµ,ij = G µ,ij |Xj |
Our goal is to determine if a violence occurs in a video |Xj |
xφ(j,k) − µi
 
1 X
and when it happens; therefore, we search for a range of =√ γφ(j,k) (i) .
wi σi
frames which contains violence. We base our approach k=1
on the temporal sliding window [12] which evaluates video
sub-sequences at varying locations and scales. Similarly, we re-write Eq. (3):
Let vl be a video length (in frames), s > 0 be the win- T
(xt − µi )2
 
dow step size (in frames), and w = {is}i=1...m be tem- 1 X
GX
σ,i = √ γt (i) − 1
poral window sizes (scales) for the sliding window algo- T 2wi t=1 σ2i
rithm. Moreover, let v = ns be an approximated video N |X j|
(xφ(j,k) − µi )2
 
length (where: ns ≥ vl > (n − 1)s and n ≥ m ≥ 1). Vi- 1 X X
= √ γφ(j,k) (i) −1
sualization of a sample video and sample sliding windows T 2wi j=1 k=1 σ2i
is presented in Fig. 1. Note the IFV are calculated for fea- N N
tures from the same temporal segments multiple times, i.e. 1 X Xj 1 X Xj
= G |Xj | = H ,
m(n − m + 1) times for m segments (e.g. Fig. 1: 20 times T j=1 σ,i T j=1 σ,i
for 8 segments). Therefore, to speed up the detection frame- X X
work, we re-formulate the IFV and use the summed area ta- Hσ,ij = G σ,ij |Xj |
ble data structure, so that the IFV are calculated for features |Xj |
(xφ(j,k) − µi )2
 
1 X
from the temporal segments only ones. =√ γφ(j,k) (i) −1 .
Let X = {xt , t = 1 . . . T } be a set of T local features 2wi k=1 σ2i
extracted from a video. Let X0 = {Xj , j = 1 . . . N } be a (10)
|Xj |
partition of a set X into N subsets Xj = {xj,k }k=1 such
X
N
S Then, let’s define the gradient vector Hλ j as a concatena-
that: |Xj | is the cardinality of the set Xj , X = Xj , X
j=1 tion of all the K gradient vectors Hµ,ij and all the K gradi-
X
ent vectors Hσ,ij , i = 1 . . . K:
X X X X X 0
Hλ j = [ Hµ,1j , Hσ,1j , . . . , Hµ,K
j j
, Hσ,K ]. (11)

The Improved Fisher Vectors representation ΦX̃


λ̃
of local
SN
features X̃ = j=M Xj , where 1 < M ≤ N , can be
calculated using:
SN SM −1
Xj Xj
H j=1 − Hλ j=1
SN
Xj
Gλ j=M
= PNλ PM −1 , (12)
j=1 |Xj | − j=1 |Xj |

and applying the power normalization and then the L2 norm


to the obtained gradient vector. The obtained representation
is exactly the same as if we use Eq. (2)-(5). However, in
contrast to the original IFV, the above equations can be di-
rectly used with data structures such as summed area table
(Integral Images) and KDD-trees. Figure 2. Sample video frames from the Violent-Flows (first row),
For the task of violence localization, we use the above Hockey Fight (second row), Movies (third row), and Violent-
Flows 21 (fourth row) datasets.
formulation of the IFV (Eq. (9)-(12)), and directly apply
the summed area table (Integral Images [26]). The 2 main
advantages of this solution are: (1) it allows to speed up the tasks in videos and they have been widely used in litera-
calculations, as every feature is assigned to each Gaussian ture [1, 27].
exactly once; e.g. we detected 25k features in a 84 frames To represent a video, we calculate a separate video rep-
long video. With m = 4 and s = 5, every feature was resentation for each descriptor independently (i.e. HOG,
assigned to each Gaussian 4 − 10 times; this is like 224k etc.) using the IFV / proposed Spatio-Temporal IFV
features were assigned to each Gaussian. In our algorithm, (Sec. 2.2), and we concatenate the obtained representations
each feature is assigned to each Gaussian exactly once. This using late fusion (i.e. per video: we concatenate the IFV-
means nearly 9 times less calculations. (2) it allows to re- based video representation from HOG with video represen-
duce the memory usage, especially when a video contains tation from HOF, etc.).
a lot of motion and dense features are extracted [27]; e.g. For violence recognition, we use linear Support Vector
we extracted ∼130k features in a 35 seconds long video Machines (SVMs) [3] classifer, which has shown to achieve
(∼3.7k features per second on average). With Improved excellent results with high-dimensional data such as Fisher
Dense Trajectories [27] (each trajectory is represented us- Vectors; as typically if the number of features is large, there
ing 426 floats), this means ∼1.6M floats to store per second is no need to map data to a higher dimensional space [10].
(segment), which is 29 times more than the IFV representa- Moreover, linear SVMs have shown to be efficient both in
tion with 128 Gaussians calculated for this segment. training and prediction steps.
For violence detection, we use the Fast Sliding Window-
3. Experimental Setup: Approach Overview based framework, explained in Sec. 2.3.
Firstly, we extract local spatio-temporal features in
videos, and we use the Improved Dense Trajectories 4. Experiments
(IDT) [27] for that; we apply a dense sampling and track
4.1. Datasets
the extracted interest points using a dense optical flow
field. Then, we extract local spatio-temporal video volumes We use 4 benchmark datasets for evaluation and we fol-
around the detected trajectories, and we represent each tra- low the recommended evaluation protocols provided by the
jectory using: Histogram of Oriented Gradients (HOG) cap- authors of the datasets. We use Violent-Flows dataset [9],
turing appearance information, and Trajectory Shape (TS), Hockey Fight dataset [23] and Movies dataset [23] for
Histogram of Optical Flow (HOF) and Motion Boundary violence recognition task. We use Violence-Flows 21
Histogram (with MBH-x and MBH-y components) descrip- dataset [9] for violence detection task. Sample video frames
tors capturing motion information. The extracted IDT fea- from the datasets are presented in Fig. 2.
tures provide a good coverage of a video and ensure extrac- The Violent-Flows (Crowd Violence \ Non-violence)
tion of meaningful information. As the results, they have dataset [9] contains 246 videos with real-world footage of
shown to achieve excellent results for various recognition crowd violence. Videos are collected from YouTube and
contain a variety of scenes, e.g. streets, football stadiums, Approach Size Violent-F. Hockey F. Movies
volleyball and ice hockey arenas, and schools. The dataset Baseline 1 93.5 93.2 97
is divided into 5 folds and we follow the recommended 5- Ours: STIFV ∼1 96.4 93.4 99
folds cross-validation to report the performance.
IFV 1x1x2 2 94.0 93.3 98.0
The Hockey Fight dataset [23] contains 1000 real-world
IFV 1x2x1 2 94.3 93.6 97.5
videos: 500 violent scenes (between two and more partici-
IFV 2x1x1 2 94.3 93.2 97.5
pants) and 500 non-violent scenes. Videos are divided into
IFV 1x1x3 3 93.5 93.1 98.5
5 folds, where each fold contains 50% violent and 50% non-
IFV 1x3x1 3 94.3 93.2 97.0
violent videos, and we follow the recommended 5-folds
IFV 3x1x1 3 93.5 93.2 97.5
cross-validation to report the performance.
IFV 2x2x2 8 93.5 93.4 97.0
The Movies dataset [23] contains 200 video clips: 100 IFV 2x2x3 12 93.1 93.4 97.0
videos with a person-on-person fight (collected from ac- IFV 2x2x1 4 93.9 93.8 97.5
tion movies) and 100 videos with non-fight scenarios (col- IFV 2x1x2 4 93.5 92.9 98.0
lected from various action recognition datasets). This IFV 1x2x2 4 93.9 93.5 97.5
dataset contains a wider variety of scenes than the Hockey
Fight dataset, and scenes are captured at different resolu- Table 1. Evaluation results: the baseline (IFV with 1x1x1) ap-
tions [23]. Videos are divided into 5 folds and we follow proach, our IFV with spatio-temporal information (STIFV), and
the recommended 5-folds cross-validation to report the per- the IFV with various spatio-temporal grids on the Violent-Flows,
formance. Although this dataset does not contain surveil- Hockey Fight, and Movies datasets. Second column presents the
lance videos, it has been widely used in the past for violence size of the video representation relatively to the size of the video
recognition task. representation of the baseline approach.
The main differences between above datasets are:
various scenarios and scenes, violence/fight and non-
detection results, we use the Receiver Operating Character-
violence/non-fight classes variations, number of training
istic (ROC) curve and the Area Under Curve (AUC) met-
and testing samples, pose and camera view point variations,
rics.
motion blur, background clutter, occlusions, and illumina-
tion conditions.
The Violent-Flows 21 dataset (Crowd Violence \ Non- 4.3. Results: Violence Recognition
violence 21 Database) [9] contains 21 videos with real- For violence recognition, we evaluate the standard IFV
world video footage of crowd violence. Videos are collected approach (baseline approach) and our IFV with spatio-
from YouTube, they are of spatial size 320×240 pixels, and temporal information (STIFV, Sec. 2.2). Moreover, we
they begin with non-violent behavior, which turns to violent evaluate the IFV with 11 various spatio-temporal grids
mid-way through the video. The training is performed us- (1x1x2, 1x2x1, 2x1x1, 1x1x3, 1x3x1, 3x1x1, 2x2x2,
ing 227 out of 246 videos from the Violent-Flows dataset; 2x2x3, 2x2x1, 2x1x2, and 1x2x2). The evaluations are
19 videos are removed as they are included in the detection performed on 3 datasets: Violent-Flows, Hockey Fight and
set. The original annotations are not available. Therefore, Movies datasets. The results are presented in Table 1. In all
as proposed in the original paper [9], we manually mark cases, our STIFV approach outperforms the IFV method,
the frame in each video where the transition happens from and achieves better or similar performance as compared to
non-violent to violent behavior 1 . the IFV with spatio-temporal grid. Note that finding an ap-
propriate size of the spatio-temporal grid is time consuming
4.2. Implementation Details (there are 3 additional parameters to learn). Moreover, a
spatio-temporal grid-based representation requires signifi-
We use the GMM with K = 128 and K = 256 to com-
cantly more amount of memory (up to 12 times in our ex-
pute the IFV / Spatio-Temporal IFV, and we set the num-
periments, see Table 1).
ber of Gaussians using 5-folds cross-validation. To increase
clustering precision, we initialize the GMM 10 times and Then, we compare our approach with the state-of-the-
we keep the codebook with the lowest error. To limit the art. The comparison on the Violent-Flows, Hockey Fight,
complexity, we cluster a subset of 100, 000 randomly se- and Movies datasets is presented in Table 2. Note that our
lected training features. To report recognition results, we approach significantly outperforms remaining techniques,
use the Mean Class Accuracy (MCA) metric. For violence achieving even up to 11% better results (on the Violent-
detection, we use six temporal windows of length {5i}6i=1 Flows dataset).
frames and the window stride equal to 1 frame. To report In summary, for violence recognition, the proposed im-
provement (IFV with spatio-temporal information) boosts
1 Differences can exist between our and [9] annotations. the state-of-the-art IFV, and achieves better or similar ac-
Violent-Flows Dataset Hockey Fight Dataset Movies Dataset
Approach Acc. (%) Approach Acc. (%)
HNF [16] 56.5 Approach Acc. (%) STIP-HOG + HIK [23] 49
HOG [16] 57.4 LTP [30] 71.9 STIP-HOF + HIK [23] 59
HOF [16] 58.3 ViF [9] 82.9 BoW-MoSIFT [5] 86.5
LTP [30] 71.5 STIP-HOF + HIK [23] 88.6 MoSIFT + HIK [23] 89.5
Jerk [6] 74.2 Extreme Accelerations [5] 90.1 ViF [9] 91.3
Interaction Force [20] 74.5 MoSIFT + HIK [23] 90.9 Jerk [6] 95.0
ViF [9] 81.3 BoW-MoSIFT [5] 91.2 Interaction Force [20] 95.5
L Cv
HOT [22] 82.3 STIP-HOG + HIK [23] 91.7 F |F [21] 96.9
L Cv
F |F [21] 85.4 Our Approach 93.7 Extreme Accelerations [5] 98.9
Our Approach 96.4 Our Approach 99.5

Table 2. Comparison with the state-of-the-art on the Violent-Flows (left table), Hockey Fight (middle), and Movies (right) datasets.

same results). The results and comparison with the state-of-


the-art are presented in Figure 3 (using the ROC curves) and
in Table 3 (using the AUC metric).
Then, we evaluate the speed of the Improved Dense Tra-
jectories (IDT), and we compare the speed of the standard
sliding window approach with the speed of our Fast Slid-
ing Window technique (Sec. 2.3). The results are presented
in Table 4. We observe that the proposed Fast Sliding Win-
dow technique is more than 10 times faster than the standard
sliding window approach.

Figure 3. ROC curves: our approach (on the left) vs. the state-of-
the-art (on the right) on the Violent-Flows 21 dataset. 5. Conclusions
Approach LTP HOG HOF HNF VIF Ours We have proposed an extension of the Improved Fisher
AUC 79.9 61.8 57.6 59.9 85.0 87.0 Vectors (IFV) for violence recognition in videos, which
allows to represent a video using both local features and
Table 3. AUC metric on the Violent-Flows 21 dataset [9]. their spatio-temporal positions. The proposed extension has
shown to boost the IFV achieving better or similar accuracy
Process Processing Time (fps) (and keeping the representation more compact) as compared
Feature Extraction (IDT) 5.7 to the IFV with spatio-temporal grid. Moreover, our ap-
Sliding Window 9.28 proach has shown to significantly outperform the existing
Ours: Fast Sliding Window 99.21 techniques on three violence recognition datasets. Then, we
have studied the popular sliding window approach for vio-
Table 4. Average processing time on the Violent-Flows 21 dataset lence detection. We have re-formulated the IFV and have
using a single Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz. used the summed area table data structure to significantly
speed up the violence detection framework. The evaluations
curacy (keeping the representation more compact) as com- have been performed on 4 state-of-the-art datasets.
pared to the IFV with spatio-temporal grids. Moreover, our
approach significantly outperforms the existing techniques
on all three violence recognition datasets. Acknowledgements.
The research leading to these results has received fund-
4.4. Results: Violence Detection
ing from the People Programme (Marie Curie Actions)
We evaluate our Fast Sliding Window-based approach on of the European Union’s Seventh Framework Programme
the Violence-Flows 21 dataset. FP7/2007-2013/ under REA grant agreement no [324359].
Firstly, we evaluate the accuracy of the sliding window / However, the views and opinions expressed herein do not
Fast Sliding Window approach (both techniques achieve the necessarily reflect those of the financing institutions.
References [16] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.
Learning realistic human actions from movies. In IEEE
[1] P. Bilinski and F. Bremond. Video Covariance Matrix Log- Conference on Computer Vision and Pattern Recognition
arithm for Human Action Recognition in Videos. In Inter- (CVPR), 2008.
national Joint Conference on Artificial Intelligence (IJCAI), [17] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Fea-
2015. tures: Spatial Pyramid Matching for Recognizing Natural
[2] P. Bilinski, M. Koperski, S. Bak, and F. Bremond. Repre- Scene Categories. In IEEE Conference on Computer Vision
senting Visual Appearance by Video Brownian Covariance and Pattern Recognition (CVPR), 2006.
Descriptor for Human Action Recognition. In IEEE Inter- [18] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos.
national Conference on Advanced Video and Signal-Based Anomaly Detection in Crowded Scenes. In IEEE Confer-
Surveillance (AVSS), 2014. ence on Computer Vision and Pattern Recognition (CVPR),
[3] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support 2010.
Vector Machines. ACM Transactions on Intelligent Systems [19] S. McCann and D. G. Lowe. Spatially Local Coding for Ob-
and Technology (TIST), 2(3):27, 2011. ject Recognition. In Asian Conference on Computer Vision
[4] X. Cui, Q. Liu, M. Gao, and D. N. Metaxas. Abnormal De- (ACCV), 2012.
tection Using Interaction Energy Potentials. In IEEE Confer- [20] R. Mehran, A. Oyama, and M. Shah. Abnormal Crowd Be-
ence on Computer Vision and Pattern Recognition (CVPR), havior Detection using Social Force Model. In IEEE Confer-
2011. ence on Computer Vision and Pattern Recognition (CVPR),
[5] O. Daniz, I. Serrano, G. Bueno, and T.-K. Kim. Fast Vi- 2009.
olence Detection in Video. In International Conference on [21] S. Mohammadi, H. Kiani, A. Perina, and V. Murino. Vio-
Computer Vision Theory and Applications (VISAPP), 2014. lence detection in crowded scenes using substantial deriva-
[6] A. Datta, M. Shah, and N. D. V. Lobo. Person-on-Person Vi- tive. In IEEE International Conference on Advanced Video
olence Detection in Video Data. In International Conference and Signal-Based Surveillance (AVSS), 2015.
on Pattern Recognition (ICPR), 2002. [22] H. Mousavi, S. Mohammadi, A. Perina, R. Chellali, and
[7] F. D. de Souza, G. C. Chavez, E. A. do Valle, and V. Murino. Analyzing Tracklets for the Detection of Ab-
A. de A Araujo. Violence Detection in Video Using Spatio- normal Crowd Behavior. In IEEE Winter Conference on Ap-
Temporal Features. In SIBGRAPI Conference on Graphics, plications of Computer Vision (WACV), 2015.
Patterns and Images, 2010. [23] E. B. Nievas, O. D. Suarez, G. B. Garcia, and R. Sukthankar.
[8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior Violence Detection in Video Using Computer Vision Tech-
Recognition via Sparse Spatio-Temporal Features. In IEEE niques. In International Conference on Computer Analysis
International Workshop on Visual Surveillance and Perfor- of Images and Patterns (CAIP), 2011.
mance Evaluation of Tracking and Surveillance, 2005. [24] F. Perronnin, J. Sanchez, and T. Mensink. Improving the
[9] T. Hassner, Y. Itcher, and O. Kliper-Gross. Violent Flows: Fisher Kernel for Large-Scale Image Classification. In Euro-
Real-Time Detection of Violent Crowd Behavior. In IEEE pean Conference on Computer Vision (ECCV), 2010.
Conference on Computer Vision and Pattern Recognition [25] J. Sanchez, F. Perronnin, and T. De Campos. Modeling the
(CVPR) Workshops, 2012. Spatial Layout of Images Beyond Spatial Pyramids. Pattern
[10] C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A Practical Guide to Recognition Letters, 33(16):2216–2223, 2012.
Support Vector Classification. Technical report, Department [26] O. Tuzel, F. Porikli, and P. Meer. Region Covariance: A Fast
of Computer Science, National Taiwan University, 2003. Descriptor for Detection And Classification. In European
[11] V. Kantorov and I. Laptev. Efficient feature extraction, en- Conference on Computer Vision (ECCV), 2006.
coding and classification for action recognition. In IEEE [27] H. Wang and C. Schmid. Action Recognition with Improved
Conference on Computer Vision and Pattern Recognition Trajectories. In IEEE International Conference on Computer
(CVPR), 2014. Vision (ICCV), 2013.
[12] A. Klaser, M. Marszalek, C. Schmid, and A. Zisserman. Hu- [28] L. Wang, Y. Qiao, and X. Tang. Action Recognition with
man Focused Action Localization in Video. In Trends and Trajectory-Pooled Deep-Convolutional Descriptors. In IEEE
Topics in Computer Vision, pages 219–233. Springer, 2010. Conference on Computer Vision and Pattern Recognition
[13] J. Krapac, J. Verbeek, and F. Jurie. Modeling Spatial Lay- (CVPR), 2015.
out with Fisher Vectors for Image Categorization. In IEEE [29] G. Willems, T. Tuytelaars, and L. Van Gool. An Efficient
International Conference on Computer Vision (ICCV), 2011. Dense and Scale-Invariant Spatio-Temporal Interest Point
[14] J. Krapac, J. Verbeek, and F. Jurie. Spatial Fisher Vectors for Detector. In European Conference on Computer Vision
Image Categorization. Research Report RR-7680, INRIA, (ECCV), 2008.
2011. [30] L. Yeffet and L. Wolf. Local Trinary Patterns for human ac-
[15] I. Laptev. On Space-Time Interest Points. International Jour- tion recognition. In IEEE International Conference on Com-
nal of Computer Vision (IJCV), 64(2-3):107–123, 2005. puter Vision (ICCV), 2009.

You might also like