Temporal Anomaly Forged Scene Detection
by Referring Video Discontinuity Features
Govindraj Chittapur1(B) , S. Murali2 , and Basavaraj S. Anami3
1 Basaveshwar Engineering College, Bagalkot 587 102, India
gbchittapur@gmail.com
2 Maharaja Institute of Technology Mysore, Srirangapatna 571 477, India
murali@mitmysore.in
3 KLE Institute of Technology, Hubli 580 027, India
Abstract. Today we are living in the techno-social world, where we are seeing
news and social media. We are believing published videos and images are the
sources of trustworthiness. Due to the high-end easily available media editors
believing and trusting published media is a challenge. CC Tv introduced to capture
live videos and images for security purposes till due to the limitation of storage,
managing cc tv footage and identifying anomalous forge scenes in huge media
footage sources is a current proposal of our research work. In this paper, we are
proposing forged anomalous scene detection by referring to discontinuity features
using open-source deep learning and recurrent and transfer learning approaches.
We propose to learn a classification model to identify the frames that are anomalies
in nature. After that, the input video is separated into frames, and we propose to
check each frame for its forged anomaly scene. The frames that are classified as
forged anomaly scenes are then arranged temporally to maintain the continuity
of the video. This makes sure that the rendered scene maintains its temporal
continuity. We propose to detect the doctoring in the video assuming the sudden
change in the scene as a case-specific doctoring. We propose to learn the flow
vectors for both continuity and discontinuity. In a test video, we propose to generate
the flow vectors using the Lucas Kanade technique and then test the flow vectors
for their pattern recognition with the trained model. The decision obtained on flow
vectors is then superimposed on the input video.
Keywords: Anomaly-forged-scene · Video-discontinuity-features · CCTV
footage · Transfer learning · Temporal continuity · Lucas-Kanade-technique
1 Introduction
The inclination to form assumptions about what we see is based on our preferences.
It relates not only to the first-hand assimilation of real-world events but also to the
second-hand absorption of their pictorial representations from these events. Of all the
kinds of such representations that have been developed, nothing evokes more belief than
videos or has a greater substantial impact on our vision. Viewing of a recording of any
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
D. S. Guru et al. (Eds.): ICCR 2021, CCIS 1697, pp. 1–11, 2022.
https://doi.org/10.1007/978-3-031-22405-8_33
2 G. Chittapur et al.
incident can act as a proxy for attending the real event itself, and this influence on the
interpretation makes video documentation an exceptionally reliable source of proof.
Videos have a robust evidentiary value especially from a forensic point of view.
Digital recordings, for example, can provide some of the most inculpatory facts in a
court of law. Videos, now a days, are sufficient to provide an eyewitness account [1],
replacing other kinds of forensic evidence, for example circumstantial evidence, or say
genome and palm prints. As it is usually difficult to distrust the proof of one’s own eyes,
the testimony of a video often becomes indisputable. But to be able to use any form of
evidence it is necessary to be mindful of its limitations, including its vulnerability to
deliberate and conscious semantic manipulations.
Digital video innovation has been rapidly evolving in recent years. It is now relatively
simple to record large volumes of films thanks to technological advancements. There
is a vast amount of digital available information, including news, movies, sports, and
documentaries, among other things. The requirement for digital multimedia interpreta-
tion and retrieval has grown in importance as the volume of multimedia data has grown
rapidly, as has the desire for quick access to pertinent data. The first step in analyzing
video footage for indexing, browsing, and searching is to break it down into shots. A
shot is a series of visual frames taken by a single camera in a continuous sequence. Shot
transitions can be classified into two parts: sudden changes (cuts) and progressive shifts.
Wipes, fades, and dissolves are common gradual alterations that are more difficult to
detect than sudden changes. Furthermore, storing large amounts of video data is difficult.
Because end users want to receive all essential parts of data, it is critical to swiftly get and
browse large volumes of data. Furthermore, due to its economic viability, particularly
for video streaming applications, techniques for automatic video content summarizing
have attracted a lot of attention. A brief video summary should, intuitively, highlight the
video content and have little redundancy while maintaining the original video’s balance
coverage. A video summary, on the other hand, should not be confused with video trail-
ers, in which certain information are purposefully suppressed in order to increase the
allure of a video.
A raw audiovisual stream is an unstructured data stream made up of a collection of
pictures. A video shot consists of multiple frames, and keyframes can represent its visual
content. The video scene describes keyframes derived from a video as sets. In general, a
two-phase theory works on each system for discontinuity detection. The scoring stage is
the initial step, in which a score is assigned to each pair of uninterrupted picture frames
in a digital video, signifying their similarity or dissimilarity. The second phase is the
process of judgment. Here all previously measured scores are evaluated, and if the score
is deemed high, a cut is observed. Firstly, because even small threshold exceedances
lead to a hit, it must be ensured that phase one broadly scatters to maximize the mean
difference between the “cut” and “no-cut” score. Second, the threshold must be carefully
selected; typically, useful values are obtained by using statistical methods.
A video is hierarchically structurally ordered, so the video can be processed and
distributed as tiniest components by shots and indexed by a consecutive keyframes and
reassembled at the receiving end. Therefore, only relative shots need to be resent when
transmission errors occur. Additionally, complex video retrieval tasks are turned into
simple image comparison exercises between the corresponding main frames by using
Temporal Anomaly Forged Scene Detection 3
keyframes. The server only must compare keyframes for a user query and request an I O
file operation to retrieve the relative video segment for client transmission. As a result,
video discontinuity detection can effectively prompt broadband utilization, reduce the
amount of data-stream manipulation, and save computation time and Input and Output
admittance. Anomaly in the video can be defined as noticing the sudden change in the
scene. For example, in a car parking CCTV camera footage, if a lorry comes for parking,
it is an anomaly. In a night surveillance video, when no one is around, if we encounter
someone entering suddenly, then it is considered as an anomaly.
2 Literature Survey
In the literature, we can find many works in the area of video discontinuity detection.
This field can also be called as the video segmentation as it basically produces different
segments of the same video.
Most tested video segmentations-algorithms compare frame differences. Differences
in pixel values and histograms in the uncompressed domain, or DCT coefficients, macro-
block types, and motion vectors in the compressed domain, are examples of contrasted
differences. Zhang et al. compare the DCT coefficients of comparable blocks of adjacent
video frames using a pair-wise comparison technique. The block is altered if the disparity
exceeds a defined threshold T1. A transition between the two consecutive frames is
declared if the number of modified blocks is greater than another threshold T2. To define
scene changes, Meng et al. use the variance of DC coefficients in I and P frames, as well
as motion vectors. Calculating the variance of the DC time sequence for I and P frames
and detecting parabolic patterns in this curve are used to detect gradual transitions. The
ratio of intra and forward MBs, the ratio of rearward and forward predicted MBs, and
the ratio of forward and rearward anticipated MBs in the present frame are used to
detect cuts. The main drawbacks of the approaches described above are that ad - hoc
basis threshold selection is challenging to adapt to different types of videos, and camera
action and substantial object movement frequently reduce detection accuracy.
On the DC image of encoding data, Yeo and Liu proposed detecting scene changes.
Before being used in the identification of scene alteration detection, the DC sequence
is first recreated using the approximation method. They talked about sequential pixel
differences and statistical color comparisons. The difference between consecutive pixels
is subject to camera and object motion. DC sequences, on the other hand, are less sensitive
to camera and object movements since they are smoothed pictures of the entire images.
Color statistical comparisons are less susceptible to motion, but they are more expensive
to compute.
A two-pass technique was proposed by Zhang et al. Using the pairwise DCT coef-
ficient comparison of I frames, they first find the areas of probable transitions, camera
actions, and object movements. The subsequent pass’s purpose is to fine-tune and vali-
date the break points discovered by the first pass. The specific cut areas are calculated by
counting the number of MVs M for the designated zones. M T (where T is a threshold
near to zero) is an effective indicator of a cut either during the B and P frames if M
denotes the number of MVs in P frames and the smaller of the numbers of forward and
backward non-zero MVs in B frames. The DCT variations of I frames are used to find
gradual transitions using an adaption of the twin comparison approach.
4 G. Chittapur et al.
Feng et al. presented an approach based on macro-block kinds and bitrate informa-
tion. It’s a fast and inexpensive solution, even if it’s bound to cut detection. A significant
difference in bitrate between two consecutive I or P frames implies a cut. The amount
of backward predicted Megabytes per second is used to detect cuts on B pictures, sim-
ilarly to Meng et al. (1995). The system can pinpoint the precise cut areas. It works in
a hierarchical manner, first detecting a suspected cut across two I frame, then between
the GOP’s P frames, and finally inspecting the B pixels.
Boccignone et al. suggested a novel method for splitting a video into shots predicated
on a subpixel rendering description. The shot-change detection approach is more directly
related to the estimation of a consistency measure of the fixation sequences created by an
ideal spectator glancing at the video at each time interval. Rather than a series of specific
techniques, their approach tries to detect both sudden and smooth movements between
frames using a single technique. Their approach only detects shot cut and dissolve
transitions and detects the shot boundary in the uncompressed domains. Algorithms in
the uncompressed domain are already computationally expensive, therefore additional
calculation time is longer than in the compressed domain.
Model-based approaches are another type of shot transition detection tool. Based
on statistical sequential analysis and operating on compressed multimedia bitstreams,
Lelescu and Schonfeld present a novel one-pass, real-time technique to scene change
detection. They characterize video frames as perturbation theory, with changes in the
process’s properties reflecting scene changes.
Bisco’s et al., mention a different identification model that works for both abrupt and
slow transformations. They map the inter-frame proximity space onto a decision-making
space that is best suitable to reaching a sequence-independent threshold. Unsupervised
and supervised classification techniques are the third category of shot transition detection
techniques.
Gao and Tang examine a video shot edge detection approach that uses the frame’s
histogram-based metrics (HDM) and spatial difference metrics (SDM) as features. The
solution to the problem of shot transition detection is to divide the feature space into two
feature space: “scene change” and “no scene change”. The “scene shift” categorization
is further divided into two types: rapid and gradual transitions. Gunsel et al. consider
syntactic video shot identification as a two-class clustering issue, with “scene change”
and “no scene change” as the two classes, accordingly. They propose using the K-means
clustering algorithm on the colour histogram measure of similarity between successive
frames to classify frames region of interest.
A supervised classification technique for video shot splitting is presented by Qi et al.
Frame differences and consecutive frames features such as the chance of camera cal-
ibration at the consecutive frames and the likelihood that the current frame is a gray
frame were employed as frame features. Three types of classifiers, the k-nearest neigh-
bour classification, the Nave Bayes probabilistic classification, and the support vector
machine, are used to classify the frames into “non-cut frames” and “cut frames”, accord-
ingly. They then apply the second level binary classifier to detect a “gradual transition
frame” from a “shot frame” for such “no-cut frames.”
We’ve come to the end of this brief overview of anomaly scene detect-
ing methods. Most of the survey focus on traditional approach and focusing
Temporal Anomaly Forged Scene Detection 5
on fewer features of video segments, wearefocusingonsingle-
scenevideoanomalydetectionsinceithasthemostimmediate use in real-world applications
(e.g., surveillance camera monitoring of one location for extended periods), and it is also
the most common use-case in video anomaly detection. For such applications, it is more
time-efficient to have a computer do this task in comparison to a person since there is
nothing interesting going on for long periods of time. In fact, this is the driving force
behind modern intelligent video surveillance systems. By using computer vision analyt-
ics, it will not only increase the efficiency of video monitoring, but it will also reduce
the burden on live monitoring from humans.
3 Data Set Design Issues
Several datasets for single-scene video anomaly detection are now available. SULFA [1]
REWIND [2], SYSUOBJFORGE [3], VTD [4] UCSD Ped1 & Ped2 [5], CUHK Avenue
[4], Subway [6], UMN [7], and Street Scene [8] are among these datasets. We choose
to use the SULFA, REWIND, VTD and SYSUOBJFORGE dataset for our proposed
paper. We chose this dataset since it has been widely utilized [1–4], providing us with
benchmarks for the accuracy that can be obtained from recent literature investigations.
It is still a long way from being fully realized due to the difficulty of modelling
anomaly events as well as dealing with the sparsity of occurrences in data sets. Further-
more, generative techniques for video anomaly detection have received little attention
in the past. In this research, we look at a variation autoencoder technique for forgery
scene anomaly identification (Table 1).
Table 1. Overview of forensic forgery dataset
Name of investigated Total number of Tested video frame Forgery approach used
dataset videos tested in the resolution for creating dataset
experiment
Original Doctored
SULFA [31] 10 30 320 × 240 Copy-move
REWIND [33] 10 10 320 × 240 Copy-move
VTD [32] 26 30 1280 × 720 Copy move/splicing
/swapping
SYSU-OBJ-FORGE 100 100 1280 × 720 Copy-create
[34]
GRIP [30] 10 10 320 × 240 Copy-create
We chose the above-mentioned dataset since it has been commonly used by digital
forensic researchers [1–4], providing us with reference points in terms of the accuracy
that may be derived from other recent publications.
The most significant aspects for video summarization are the temporal relationships
between the frames of the movie, which we propose to use to solve the problem of video
6 G. Chittapur et al.
discontinuity detection. Figure 1 depicts a thorough model of a system. For each scene,
we extract the optical flow features, which are then described using a single vector. The
matrices are then sent into a training module, which determines the hypotheses. This
idea is presently being tested on a pair of test frames. For a test frame pair, we retrieve
the same feature and pass it to the model. We can then determine whether the frame is
a crucial frame or not. This process is performed for all frame, resulting in a faster and
more accurate video anomaly detection.
Fig. 1. Schematic view of proposed forged anomaly detection using Video Discontinuity feature
set.
4 Implementation and Result
We took the training set and separated the training video frames into temporal sequences
before we started creating and training any models. Using the sliding window technique,
these temporal sequences have a size of ten. We next scaled each frame to 256 × 256
pixels to ensure that all input frames had the same resolution. By dividing each pixel by
256, the pixels were scaled between 0 and 1. We used data augmentation in the temporal
dimension due to the large number of factors. Concatenating frames with varied skipping
strides resulted in more training sequences.
Temporal Anomaly Forged Scene Detection 7
We begin by introducing the encoder and decoder. The encoder receives a chrono-
logically ordered sequence of frames as input. It was also split into two parts: a spatial
encoder and a temporal encoder, with the spatial encoder’s output providing as the
temporal encoder’s input for motion encoding. The decoder’s function is to mirror the
encoder and reassemble the streaming video. Our baseline’s architecture diagram LSTM
as shown in the Fig. 2.
Fig. 2. Base-line architecture diagram for LSTM.
This section provides Tensor Flow for a commonly used open-source deep-learning
library (Clark 2018). Over several years, Tensor Flow has been the most omnipresent
deep learning library (Hale 2018). Tensor Flow not only supports high-performance com-
puting through its cloud service but also has an existing library update and maintenance
community. TensorFlow Hub is a library for modular modules for machine learning. A
reusable machine learning module is an individual component of a TensorFlow map with
weights and properties that can be reused in a process known as transfer learning across
a variety of tasks. Compared with the standard training cycle of a neural network node,
the node trained in the transmission is typically trained with a smaller data set, and other
benefits can also be obtained from transfer learning, for example, better generalization
and increased training speed. It usually takes hundreds of GPU hours or more to create a
new image recognition module from scratch. The implementation of transference learn-
ing on a trained module will significantly reduce the size of the training dataset, which
could be used to solve classification problems with relatively small datasets. Figure 3
illustrates how transfer learning substitutes for the initial layer and generates a new layer
for the classification of new labels. Figure 4 and Fig. 5 demonstrate the result of anomaly
scene extracted from proposed algorithm using transfer learning using VTD and GRIP as
representative data set result. In Fig. 6 plot shows the extracted of anomaly scene video
discontinuity points using optical flow vector by referring transfer learning approach.
8 G. Chittapur et al.
Fig. 3. The Original Layer will be replaced with a layer that fine-tunes its weight and bias to
distinguish images on new labels in Transfer Learning.
Fig. 4. Resultant anomaly scene extracted from representative trained video from VTD [35]
dataset
Fig. 5. Resultant anomaly scene extracted from representative trained video from GRIP [35]
dataset
Temporal Anomaly Forged Scene Detection 9
Fig. 6. Plot shows the video discontinuity-based anomaly scene extracted using proposed
algorithm
5 Conclusion
The necessity for based on video discontinuity identification has risen as a result of recent
advances in the field of video analytics. Many strategies for detecting video discontinuity
have been investigated and classified. It has been found that not all discontinuity detection
algorithms are appropriate for every case. Some methodologies (low level feature based)
are useful for real applications because they are computationally simple and quick,
whereas others (high level feature based, User attention model based) are particularly
well suited for applications which require reliable and accurate data regardless of time
required to develop the summary. Each technique has advantages and disadvantages,
but it is clear that a technique that is independent of the application is required. Second,
there is a lack of standard evaluation approaches; before, user-provided conclusions were
used to assess the automated generated summary; later, shot reconstructing degree and
accuracy were presented and implemented.
References
1. Bescos, J., Cisneros, G., Martinez, J.M.: A unified model for techniques on video-shot
transition detection. IEEE Trans. Multimed. 7(2), 293–307 (2005)
2. Boccignone, G., Chianese, A., Moscato, V., Picariello, A.: Foveated shot detection for video
segmentation. IEEE Trans. Circ. Syst. Video Technol. 15(3), 365–377 (2005)
10 G. Chittapur et al.
3. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Kluwer
Academic Publishers, Boston (1998)
4. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machine (2001)
5. Feng, J., Lo, K.-T., Mehrpour, H.: Scene change detection algorithm for MPEG video
sequence. In: Proceedings of the IEEE International Conference on Image Processing,
Lausanne, Switzerland, pp. 821–824 (1996)
6. Chittapur, G.B., Murali, S., Prabhakara, H.S., Anami, B.S.: Exposing digital forgery in video
by mean frame comparison techniques. In: Sridhar, V., Sheshadri, H., Padma, M. (eds.) Emerg-
ing Research in Electronics, Computer Science and Technology. LNEE, vol. 248, pp. 557–562.
Springer, New Delhi (2014). https://doi.org/10.1007/978-81-322-1157-0_57
7. Chittapur, G., Murali, S., Anami, B.S.: Forensic approach for region of copy-create video
forgery by applying frame similarity approach. In: 6th International Virtual Congress (IVC-
2019) (2019). www.isca.net.co 5th to 10th August 2019. Souvenir of IVC-2019 with ISBN
978-93-86675-55-2
8. Friedman, J.: Another approach to polychotomous classification. Technical report. Depart-
ment of Statistics, Stanford University (1996)
9. Gao, X., Tang, X.: Unsupervised video-shot segmentation and model-free anchorperson detec-
tion for news video story parsing. IEEE Trans. Circ. Syst. Video Technol. 12(9), 765–776
(2002)
10. Chittapur, G., Murali, S., Anami, B.: Tempo temporal forgery video detection using machine
learning approach. J. Inf. Assur. Secur. (JIAS) 15(4), 144–152 (2020). ISSN: 1554-1010
11. Chittapur, G., Murali, S., Anami, B.S.: Forensic approach for object elimination and frame
replication detection using noise based Gaussian classifier. Int. J. Comput. Eng. Res. Trends
(IJCERT) 7(03), 1–5 (2020). ISSN: 2349-7084
12. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise procedure
for building and training a neural network. In: Soulié, Françoise Fogelman., Hérault, Jeanny
(eds.) Neurocomputing: Algorithms, Architectures and Applications, pp. 41–50. Springer,
Heidelberg (1990). https://doi.org/10.1007/978-3-642-76153-9_5
13. Lelescu, D., Schonfeld, D.: Statistical sequential analysis for real-time video scene change
detection on compressed multimedia bitstream. IEEE Trans. Multimed. 5(1), 106–117 (2003)
14. Lo, C.-C., Wang, S.-J.: Video segmentation using a histogram-based fuzzy C-means clustering
algorithm. In: IEEE International Conference on Fuzzy Systems, pp. 920–923 (2001)
15. Chittapur, G., Murali, S., Anami, B.S.: Copy create video forgery detection techniques using
frame correlation difference by referring SVM classifier. Int. J. Comput. Eng. Res. Trends
(IJCERT) 6(12), 4–8 (2019). ISSN: 2349-7084
16. Chittapur, G., Murali, S., Anami, B.S.: Forensic approach for region of copy-create video
forgery by applying frame similarity approach. Res. J. Comput. Inf. Technol. Sci. (RJCITS)
7(2), 12–17 (2019). ISSN: 2320-6527
17. Qi, Y., Hauptmann, A., Liu, T.: Supervised classification for video shot segmentation. In:
ICME 2003, pp. II-689–II-692 (2003)
18. Chittapur, G., Murali, S., Anami, B.S.: Video forgery detection using motion extractor by
referring block matching algorithm. Int. J. Sci. Technol. Res. (IJSTR) 8(10), 3240–3243
(2019). ISSN: 2277-8616
19. Chittapur, G., Murali, S., Anami, B.S.: Digital doctoring detection techniques. Int. J. Adv.
Technol. Eng. Res. (IJATER) 4(3), 13–17 (2014). ISSN No: 2250-3536
20. TREC 2001. Videos in the 2001 TREC video retrieval test collection (2001)
21. Vpanik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995). https://
doi.org/10.1007/978-1-4757-2440-0
22. Weston, J., Watkins, C.: Multiclass support vector machines. Technical report. CSD-TR-98-
04, University of London, UK (1998)
Temporal Anomaly Forged Scene Detection 11
23. Yeo, B., Liu, B.: Rapid scene analysis on compressed video. IEEE Trans. Circ. Syst. Video
Technol. 5(6), 533–544 (1995)
24. Zhang, H.J., Low, C.Y., Gong, Y.H., Smoliar, S.W.: Video parsing using compressed data.
In: Proceedings of the SPIE Conference on Image and Video Processing II, San Jose, CA,
pp. 142–149 (1994)
25. Zhang, H.J., Low, C.Y., Smoliar, S.W.: Video parsing and browsing using compressed data.
Multimed. Tools Appl. 1(1), 89–111 (1995)
26. Zhang, H.J., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion video.
Multimed. Syst. 1(1), 10–28 (1993)
27. http://www.grip.unina.it/download/prog/ForgedVideosDataset/Splicing/
28. Qadir, G., Yahaya, S., Ho, A.T.: Surrey university library for forensic analysis (sulfa) of video
content (2012)
29. https://sites.google.com/site/rewindpolimi/downloads/datasets/vid
30. Marra, F., Gragnaniello, D., Verdoliva, L., Poggi, G.: A full-image full-resolution end-to-
end-trainable CNN framework for image forgery detection. IEEE Access 8, 133488–133502
(2020). https://doi.org/10.1109/ACCESS.2020.3009877
31. Murali, S., Anami, B.S., Chittapur, G.B.: Detection of copy-create image forgery using lumi-
nance level techniques. In: 2011 Third National Conference on Computer Vision, Pattern
Recognition, Image Processing and Graphics, pp. 215–218 (2011). https://doi.org/10.1109/
NCVPRIPG.2011.53
32. Murali, S., Anami, B.S., Chittapur, G.B.: Detection of digital photo image forgery. In:
2012 IEEE International Conference on Advanced Communication Control and Computing
Technologies (ICACCCT), pp. 120–124 (2012). https://doi.org/10.1109/ICACCCT.2012.632
0754
33. Murali, S., Chittapur, G.B., Prabhakara, H.S.: Detection of digital photo image forgery using
copy-create techniques. In: Mohan, S., Suresh Kumar, S. (eds.) ICSIP 2012. LNEE, vol. 221,
pp. 281–290. Springer, India (2013). https://doi.org/10.1007/978-81-322-0997-3_26
34. Chen, S., Tan, S., Li, B., Huang, J.: Automatic detection of object-based forgery in advanced
video. IEEE Trans. Circ. Syst. Video Technol. 26(11), 2138–2151 (2016). https://doi.org/10.
1109/TCSVT.2015.2473436