Fall Detection
Fall Detection
                                            Abstract—Automatic fall detection is a vital technology for           discriminator is a feed-forward neural network, both of which
                                         ensuring the health and safety of people. Home-based camera              are trained in an adversarial manner. The most successful pre-
                                         systems for fall detection often put people’s privacy at risk.           vious works have mostly focused on learning spatio-temporal
                                         Thermal cameras can partially or fully obfuscate facial features,
                                         thus preserving the privacy of a person. Another challenge is the        features by using 3D Convolutional Autoencoders (3DCAE)
                                         less occurrence of falls in comparison to the normal activities          and 3D Convolution Neural Network (3DCNN) [5].
                                         of daily living. As fall occurs rarely, it is non-trivial to learn          The performance of video-based fall detection may be
                                         algorithms due to class imbalance. To handle these problems,             marred by differences in background. This may become more
                                         we formulate fall detection as an anomaly detection within an
                                                                                                                  prominent in thermal cameras where the intensities may
                                         adversarial framework using thermal imaging. We present a
arXiv:2004.08352v2 [cs.CV] 24 Oct 2020
                                         novel adversarial network that comprises of two-channel 3D               change due to differences in heat (e.g., when a person enters
                                         convolutional autoencoders which reconstructs the thermal data           the scene). Therefore, it is important to focus on the region
                                         and the optical flow input sequences respectively. We introduce          around the person. The relative motion of the person and
                                         a technique to track the region of interest, a region-based              objects around it can also provide useful information to detect
                                         difference constraint, and a joint discriminator to compute the
                                                                                                                  falls. Region and motion-based methods ( [7], [8]) have shown
                                         reconstruction error. A larger reconstruction error indicates the
                                         occurrence of a fall. The experiments on a publicly available            superior performance in action recognition tasks. Therefore,
                                         thermal fall dataset show the superior results obtained compared         we hypothesize that the learning spatio-temporal features uti-
                                         to the standard baseline.                                                lizing region and motion awareness in video sequences would
                                            Index Terms—Fall detection, adversarial learning, thermal             improve the detection of falls when trained in an adversarial
                                                                                                                  manner. To this end, we propose a motion and region aware
                                                                  I. I NTRODUCTION                                adversarial framework which consists of two separate channels
                                            Automatic detection of falls is important due to the possi-           optimized jointly. The first channel is input with thermal video
                                         bility of severe injury, high cost to the health system, and the         sequence (with the extracted region of interest) and the second
                                         psychological effect of a fall. However, due to their rarity of          channel is input with corresponding optical flow. The outputs
                                         occurrence, traditional supervised machine learning classifiers          from both channels are combined to give a discriminative
                                         are ill-posed for this problem [1]. There are also challenges            score for adversarial learning. We assume that joint training of
                                         in collecting realistic fall data as it can put people’s lives in        thermal and optical flow channels can facilitate in the learning
                                         danger [1]. Therefore, in many realistic settings, there may             of both motion and region-based discriminatory features.
                                         be few, or no falls data available during training. Due to
                                         these skewed data situations, we adopted fall detection as an                                    II. R ELATED W ORK
                                         anomaly detection problem [2]. In this setting, a classifier is
                                         trained on only normal activities; during testing, both normal              There is scarce literature on detecting falls in videos in an
                                         and fall samples are presented to the classifier.                        adversarial manner with thermal cameras. We will now review
                                            Another challenge in video-based fall detection is pre-               studies that closely match our work.
                                         serving the person’s privacy, which traditional RGB cameras                 Fall Detection: With progress in economical camera
                                         cannot provide [3]. Thus, detecting falls in videos without              sensors, there are several works [9]–[11], which use RGB
                                         explicitly knowing a person’s identity is important for the              cameras for data capturing. One major limitation of RGB
                                         usability of such systems in the real world. Thermal Imaging             sensors is the lack of privacy as the identity of the subject is not
                                         can partially or fully obfuscate a person’s identity and has             preserved. To overcome this limitation, Vadivelu et al. [3] were
                                         been used in other fall detection applications [2]–[4].                  one of the first fall detection works on thermal data. Further,
                                            Most recent works have focused on reconstruction based                Nogas et al. [4] proposed using the thermal cameras and
                                         networks for fall detection using autoencoders [2] and ad-               recurrent Convolutional AutoEncoder (CAE) for fall detection.
                                         versarial learning [5]. The adversarial learning framework               Motivated by these works, we also use the thermal camera
                                         presents a unique opportunity to train a network to not only             modality data in our experiments. The readers are also pointed
                                         mimic the normal activities through a generator but also                 to recent surveys [1], [12] in fall detection for more insights
                                         helps to discriminate it from the abnormal events through                into different techniques proposed in the literature. Most of the
                                         discriminator ( [5], [6]). For video-based anomaly detection,            recent works on fall detection using thermal and depth camera
                                         normally, the generator is some variant of autoencoder, and the          formulate it as anomaly detection.
   Anomaly Detection: Given the rare nature of fall events,                                         TABLE I
we follow the works in abnormal event detection that is                               C ONFIGURATION OF THE 3DCAE.
conceptually similar to our work. Many of the recent anomaly                         Thermal 3DCAE                Flow 3DCAE
                                                                       Input          (8, 64, 64, 1)              (7, 64, 64, 1)
detection methods ( [13], [14]) are based on one class classi-                   3D Conv - (8, 64, 64, 16)   3D Conv - (7, 64, 64, 16)
fication paradigm in which the distribution of normal events          Encoder
                                                                                  3D Conv- (8, 32, 32, 8)     3D Conv- (7, 32, 32, 8)
                                                                                  3D Conv - (4, 16, 16, 8)    3D Conv - (4, 16, 16, 8)
are learned using autoencoders and the deviations from the                         3D Conv - (2, 8, 8, 8)      3D Conv - (2, 8, 8, 8)
learned distribution is detected as anomaly during the test                      3D Deconv - (4, 16, 16, 8)  3D Deconv - (4, 16, 16, 8)
time. Hasan et al. [13] learnt the normal motion patterns in                     3D Deconv - (8, 32, 32, 8)  3D Deconv - (7, 32, 32, 8)
                                                                      Decoder
                                                                                3D Deconv - (8, 64, 64, 16) 3D Deconv - (7, 64, 64, 16)
videos using hand-crafted features and CAE. Ravanbakhsh et                       3D Deconv - (8, 64, 64, 1)  3D Deconv - (7, 64, 64, 1)
al. [14] proposed a video to flow and vice-versa generation
                                                                    in all the layers of the 3D discriminator except for the input
adversarial approach for abnormal event detection. Khan et al.
                                                                    layer. LeakyRelu activation is set in all hidden layers, with a
[15] proposed the use of 3DCAE for abnormal event detection
                                                                    negative slope coefficient set to 0.2 ( [5], [17]).
applied to fall detection. Sabokrou et al. [16] proposed an end-
                                                                       2) Adversarial Learning: In this section, we first explain
to-end adversarial network that consists of a generator that
                                                                    the general adversarial training as described in the work of
reconstructs the input with added noise and the discriminator
                                                                    Khan et al. [5]. This model consists of a 3DCAE (represented
to discriminate the reconstructed output from the actual input.
                                                                    as R), takes the input sequence I of window size T and
Further, Khan et al. [5] extend the work of Sabokrou et al. [16]
                                                                    reconstructs the sequence where the output sequence is named
from single image to a sequence of images for fall detection
                                                                    as O, which is then fed to fool 3DCNN (represented as D). R
using a spatio-temporal adversarial learning framework.
                                                                    and D are trained using standard GAN loss described as:
   Fall being a spatio-temporal change in a subject’s pose,
a limitation of the work is that the motion information is                LR+D = EI∼p [lg D(I)] + EO∼p [lg (1 − D(O))]                    (1)
not explicitly added into the network. We build upon the            We combine the adversarial loss with the Mean Squared Error
adversarial learning work of Khan et al. [5] and propose a          (MSE) loss, which is used only for R and defined as
two channel network, with one channel explicitly learning the
                                                                                           LR = E[(I − O)2 ]                              (2)
motion in the form of the optical flow while the other takes
raw video frames as input. Our proposed approach can handle         The total loss function to minimize R is defined as:
the situations where a person may not be present in a frame,                               L = LR+D + λLR                                 (3)
which may reduce false positive rate.
                                                                    where λ is a positive hyperparameter for weighted loss.
                        III. M ETHODS                               The notations used for thermal and flow networks are
                                                                    (IT ,OT ,RT ,DT ) and (IF ,OF ,RF ,DF ), respectively.
   Our proposed adversarial framework consists of two chan-
nels. The input to the first channel is a window of thermal
                                                                    B. ROI Extraction
frames and the second channel is a window of optical flow
frames. Each channel consists of (i) a 3DCAE to reconstruct            The performance of video-based fall detection methods may
the input window and (ii) a 3DCNN to discriminate them from         get impacted by background artifacts. This situation can get
the original window of frames where both of the channels are        worse with a thermal camera because changes in the heat
joined by a single neuron as a joint discriminator (Fig. 1). We     can alter the background and pixel intensity of frames in a
trained this framework using only Activities of Daily Living        video sequence [2]. Therefore, we reconstruct only the region
(ADL) from thermal frames. We perform person tracking               where the person is present, which is not affected much by
and extract the Region of Interest (ROI) from the thermal           the change in background objects and intensity. We perform
and the optical flow frames for motion and region-based             person tracking using an object detector and image processing
reconstruction.                                                     techniques to localize the person in an image.
                                                                       1) Person Detection: To the best of our knowledge, there
A. Advesarial Framework                                             are no pre-trained publicly available deep learning-based
   1) 3DCAE-3DCNN: Our architecture of 3DCAE is similar             models for person detection, specifically for thermal images.
to Khan et al. [5]. We extend their network by adding a channel     To this end, we used the Region-based Fully Convolutional
that takes optical flow as input (see Table I). We use 3D filters   Network (R-FCN) [18] trained on COCO dataset [19]. As the
of 3 × 3 with a temporal depth of 5 in all layers of 3DCAE          TSF dataset contains only one subject in a frame, the bounding
as same as Khan et al. [5]. The operations are the same in the      box with the highest confidence score is selected. There are
flow 3DCAE except for the second deconvolution layer, which         no false proposals by the detector; however, the localized
uses filters of 2 × 2 with a temporal depth of 4, to reconstruct    bounding box is found to fluctuate in size and position, which
the temporal depth of odd length.                                   degrades the prediction by the tracking method.
   The architecture of the 3DCNN is the same as the encoder            2) Contour Box Localization: In thermal images, a person
in 3DCAE, followed by a neuron at the end with a sigmoid            may appear brighter than the background due to differences in
function to output a probability of whether a sequence of           heat emitted by as a person and object. Therefore, Otsu thresh-
frames is original or reconstructed. Batch normalization is used    olding [20] is applied to the thermal image to separate dark
Fig. 1. The proposed adversarial network: top channel takes window of thermal frames and bottom channel takes window of the optical flow frames as input.
background, as shown in Fig. 2. The thresholded image may                     be at most three possible candidates for the person localization
still contain bright background objects. We find the contours                 – Detect, Contour, and Track box. Detect box confirms the
[21] on the thresholded image after applying morphological                    presence of the person, but it does not fit the person in most
operations and select the biggest contour on the basis of the                 cases. Therefore, the contour box and track box are used to
inside area. The smallest box containing that contour blob is                 improve the overall tracking. Algorithm 1 describes the whole
chosen as a candidate for the person bounding box.                            tracking method.
   3) Tracking: We apply Kalman filtering on the top-left                        4) Region based reconstruction: We remove the frames
and right-bottom coordinates of the bounding box with the                     in which a person is not localized post tracking, and rest
constant velocity assumption. The tracker is initialized with                 of the frames are masked by their corresponding bounding
the person detector and predicts the bounding box for the next                box (named as ROI mask). For region-based reconstruction,
frame. We compare the predicted box with the person detection                 the 3DCAE is fed with the window of masked frames and a
bounding box (if detected in the next frame) to check if the                  region-based reconstruction loss LROI (instead of LR ) is used:
tracker drifts. We use a counter (age) to track the number of                              LROI = EROI [(ROI(I) − ROI(O))2 ]                         (4)
continuous tracker predictions without detection. In the case of
no detection, the tracker’s age is increased, and when the age                   where ROI(X) represents the masking of frames in window
of the tracker exceeds a limit of 20, the tracker is stopped. The             X with the corresponding ROI masks, and the expectation is
Intersection of Union (IoU) is used in many tracking methods                  taken over pixels inside the ROI.
to match the bounding boxes. However, IoU is small when
the size of one box is large compared to the other box, which                 C. Motion Constraint and Reconstruction
could be possible due to the bad localization by the detector.                   Beside the appearance based constraint on R, we incorporate
Therefore, we also used other criteria such as the ratio of area              motion in the fall detection system in two ways:
and checking for a subset. At a particular instant, there could                  1) Difference Constraint: Mathieu et al. [22] compute the
                                                                              difference and gradient-based losses for future frame predic-
                                                                              tion, which increase the sharpness of the predicted frame. We
                                                                              adapted a similar technique to add an additional loss term
                                                                              in the thermal 3DCAE, which is based on the MSE of the
                                                                              difference frames of I and O. A difference frame is a residual
                                                                              map computed by subtracting two consecutive frames. We
                                                                              mask the difference frames by their respective ROI, which
                                                                              is the union of the ROIs of the two frames used to compute
                                                                              the difference frame. Further, the difference loss is defined as:
                                                                                  LDif f = EROI [(ROI(DF (I)) − ROI(DF (O)))2 ]                      (5)
                                                                              where DF(X) represents the difference frames for the window
                                                                              X. Therefore, the final loss for R in (3) with LROI and LDiff is
                                                                              defined as:
        Fig. 2. Contour box localization process (Section III-B2).                            L = LR+D + λS LROI + λD LDif f                         (6)
 Algorithm 1: Tracking Algorithm                                    in a thermal video. Therefore, we train a spatio-temporal
  Input : Frame                                                     adversarial network (RF , DF ) for flow reconstruction which
  Output: FinalBox                                                  takes into input a window of optical flow frames. We compute
  FinalBox=None;                                                    the dense optical flow frames for two consecutive frames [24].
  DetectBox=Detector.GetLocalization(Frame);                        We stack the flow in the x, y direction, and the magnitude to
  if DetectBox is not None then                                     form a 3-dimensional image (similar to [25]).
      ContourBox=GetBiggestContourBox(Frame);                          The flow images are masked with their ROI to remove
      if Tracker is not None then                                   noise due to temperature variations. As defined earlier for the
          TrackBox=Tracker.GetCurrentBox();                         difference frame, the ROI for flow image is the union of ROI
          if TrackBox matches with DetectBox then                   of the two thermal frames used to compute the optical flow.
              DetectBox=BoxSelection(DetectBox,TrackBox);
       (f) 3DCAE flow input        (g) 3DCAE flow output         (h) ROI-3DCAE flow input (i) ROI-3DCAE flow output (j) ROI-3DCAE masked
                                                                                                                          flow output
Fig. 5. Qualitative analysis- The middle frame of the input and output window reconstructed by different models: the top row images [(a)-(e)] and bottom
row images [(f)-(j)] are the inputs and corresponding outputs of the thermal data channel and flow magnitude channels, respectively.
A. Network Implementations                                                   deviation anomaly scores. We calculate and compare the AUC
   We train two adversarial models, one each for for thermal                 of ROC and PR curve using all these scores.
window (Thermal-3DCAE) and optical flow window (Flow-                        C. Ablation Studies
3DCAE) reconstruction. For region-based reconstruction, we
                                                                                1) 3DCAE: The thermal input and the reconstructed out-
train two adversarial models on thermal data, one with ROI
                                                                             put by Thermal-3DCAE can be seen in Fig. 5. The basic
masking and ROI loss (Eq. (4)) described earlier and the other
                                                                             techniques to utilize the ROI in the deep learning models
one with the addition of difference constraint in the region
                                                                             are resizing and ROI pooling. We also train two different
based reconstruction (named as Thermal-ROI-3DCAE and
                                                                             Thermal-3DCAE models by changing the thermal input (1)
Thermal-Diff-ROI-3DCAE) We train one adversarial model
                                                                             resizing input ROI to 64x64, and (2) ROI pooling to 64x64
for optical flow with region-based reconstruction, named as
                                                                             dimension. We observe that these techniques increase the
Flow-ROI-3DCAE. For the fusion networks, we train two
                                                                             false-positives, and the results for them are not reported. We
models: one without the difference constraint (Fusion-ROI-
                                                                             argue that the resizing the ROI leads to geometric distortions
3DCAE) and one with it (Fusion-Diff-ROI-3DCAE).
                                                                             and introduces false motion on the borders even if the subject
   We use the SGD optimizer with the learning rate as 0.0002
                                                                             is not moving.
for the 3DCNN discriminator and adadelta optimizer for the
                                                                                Optical Flow: As described earlier, optical flow images
3DCAE in all the adversarial models. All the models are
                                                                             contain noise due to temperature variation due to which the
trained for 300 epochs. The hyperparameters (λ’s) used for the
                                                                             reconstruction quality is also noisy (Fig. 5(g)).
weighted loss (Eq-(3), (6) and (7)) are varied between three
                                                                                2) ROI-3DCAE: The output of Thermal-ROI-3DCAE and
values [0.1, 1 and 10]. We found that the large value of these
                                                                             Flow-ROI-3DCAE is shown in Fig. 5. We observe that the
hyperparameters led to mode collapse. The best hyperparam-
eter setting for Thermal-Diff-ROI-3DCAE and Fusion-Diff-
ROI-3DCAE have all the constants equal to 1, whereas the
hyperparameter setting for the rest of the models has all the
constants equal to 0.1. The full code of our implementation is
available at https://github.com/ivineetm007/Fall-detection.
B. Evaluation Metrics
   For assessing the performance of detecting falls as anomaly,
the Area Under Curve (AUC) of Receiver Operating Charac-
teristics (ROC) and Precision-Recall (PR) curve is used. The
latter is used to specifically focus on the detection of minority
‘fall’ class. We compute the anomaly scores at the frame and
window level, which consists of two types - mean and standard                            Fig. 6. Frame level anomaly score for a fall video.
Fig. 7. Plots of AUC values of ROC and PR curve computed using window level anomaly scores with the variation in tolerance- Wµ (Left) and Wσ (Right).
region-based method improves the reconstruction quality in the                  that there are minor improvements in frame-level results. We
ROI region as the model learns to reconstruct only the ROI                      compare the window level results ( Fig. 7) of Thermal-ROI-
(see Fig. 5 (c), (d), and (e)). Similar behaviour is observed for               3DCAE and Flow-ROI-3DCAE with Thermal-ROI score
Flow-ROI-3DCAE (see Fig. 5 (h), (i) and (j)).                                   and Flow-ROI score of Fusion-ROI-3DCAE respectively, we
   3) Difference constraint: To understand the impact of                        observe that there is a small increment in the results by ROI
difference constraint, we compare the results computed at                       score of thermal and a substantial increment in the results by
window level (see Fig. 7) for the AUC of ROC and PR. On the                     ROI score of flow, which indicates that joint learning improved
comparison between the Diff score and the ROI score of the                      the learning of flow reconstructor.
Thermal-Diff-ROI-3DCAE, we found that the results using                            Qualitative Analysis- We use the frame based anomaly
the Diff score are better than the ROI score in all the plots,                  scores to visualize the performance of the proposed method
which suggests that the use of Diff score is more suitable                      (see Fig. 6). Although our model is able to detect fall events,
for window level analysis. We also compare the results of                       we observed a false peak when the person enters the room and
Thermal-ROI-3DCAE and Diff score of Thermal-Diff-ROI-                           our tracking method misses some frames when person enters.
3DCAE, we found that the addition of constraint increase
the AUC for both ROC and PR curve. Similar behaviour                                                    VI. R ESULTS
was observed on the comparison of Fusion-ROI-3DCAE and                             The AUC values of ROC and PR curves computed using
Fusion-Diff-ROI-3DCAE. This suggests that the difference                        frame level anomaly scores are shown in Table II. For window
constraint makes the model more discriminative in temporal                      level results, we plot the AUC values with the variation of
direction and increases the overall performance.                                tolerance (α) from 1 to 8 (see Fig. 7). The frame level and
   4) Fusion models: To understand the effect of fusion, we                     window level results can be summarized as:
first compare the results of Fusion-ROI-3DCAE with the ROI                      1) There is a improvement in the AUC of the ROC and PR
based models at frame level as shown in Table II; we found                          curve by region-based construction, which confirms the
                        TABLE II                                                    importance of region awareness.
 AUC OF ROC AND PR BASED FRAME LEVEL ANOMALY COMPARISON .                       2) Addition of difference constraint increases the AUC values
  Method                                         ROC                PR
                                              Cµ    Cσ       Cµ          Cσ
                                                                                    using window level scores, which indicates its importance
  Thermal-3DCAE                               0.88  0.90     0.47        0.48       for the learning of spatio-temporal autoencoders.
  Thermal-ROI-3DCAE                           0.89  0.92     0.55        0.57   3) Fusion models leads to an increase in the performance.
  Thermal-Diff-ROI-3DCAE (ROI score)          0.90  0.92     0.57        0.56
  Fusion-ROI-3DCAE (Thermal ROI score)        0.90  0.93     0.56        0.58      Comparison with the existing methods: In the previous
  Fusion-Diff-ROI-3DCAE (Thermal ROI score)   0.90  0.93     0.57        0.57   works using DSTCAE-C3D [2], Conv-LSTM AE [4] and
                           TABLE III                                           [4] J. Nogas, S. S. Khan, and A. Mihailidis, “Fall detection from thermal
 C OMPARISON WITH THE PREVIOUS METHODS BASED ON AUC OF ROC                         camera using convolutional lstm autoencoder,” in Proceedings of the
  AND PR CURVE CALCULATED ON FRAME LEVEL ANOMALY SCORES .                          2nd workshop on Aging, Rehabilitation and Independent Assisted Living,
   Method                All frames       Tracked frames                           IJCAI Workshop, 2018.
                            ROC         ROC             PR                     [5] S. S. Khan, J. Nogas, and A. Mihailidis, “Spatio-temporal adversarial
                         Cµ     Cσ   Cµ     Cσ     Cµ      Cσ                      learning for detecting unseen falls,” Pattern Analysis and Applications,
   Conv-LSTM AE [4]      0.76   0.83 0.63   0.73   0.26    0.37                    pp. 1–11, 2020.
   DSTCAE-C3D [2]        0.93   0.97 0.85   0.90   0.46    0.53                [6] T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-
   3DCAE-3DCNN [5]       0.95   0.95 0.90   0.88   0.47    0.48                    Erfurth, “f-anogan: Fast unsupervised anomaly detection with generative
   Fusion-Diff-ROI-3DCAE   —      —  0.90   0.93   0.57    0.57                    adversarial networks,” Medical image analysis, vol. 54, pp. 30–44, 2019.
                                                                               [7] D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei, “Recurrent tubelet proposal
3DCAE-3DCNN [5], AUC values of ROC computed using                                  and recognition networks for action detection,” in Proceedings of the
frame-level anomaly score are reported (see Table III ‘All                         European Conference on Computer Vision, 2018, pp. 303–318.
                                                                               [8] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
frames’ column). The previous methods do not perform person                        model and the kinetics dataset,” in Proceedings of the IEEE Conference
tracking in the video, due to which the number of frames                           on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
used for training and testing is different. Therefore, we train                [9] G. Baldewijns, G. Debard, G. Mertes, B. Vanrumste, and T. Croonen-
                                                                                   borghs, “Bridging the gap between real-life data and simulated data by
and test these methods using the frames on which a person                          providing a highly realistic fall dataset for evaluating camera-based fall
is localized by our tracking method. The comparison of these                       detection algorithms,” Healthcare technology letters, vol. 3, no. 1, pp.
methods with the proposed model on tracked frames only is                          6–11, 2016.
                                                                              [10] K. Sehairi, F. Chouireb, and J. Meunier, “Elderly fall detection system
shown in Table III and summarized as:                                              based on multiple shape features and motion analysis,” in 2018 Interna-
1) AUC values of previous methods decrease when only                               tional Conference on Intelligent Systems and Computer Vision (ISCV).
                                                                                   IEEE, 2018, pp. 1–8.
    tracked frames are used. Furthermore, the empty frames                    [11] S. Ezatzadeh and M. R. Keyvanpour, “Vifa: an analytical framework for
    in train and test set videos are similar. Low reconstruction                   vision-based fall detection in a surveillance environment,” Multimedia
    error on these frames may give high AUC values in                              Tools and Applications, vol. 78, no. 18, pp. 25 515–25 537, 2019.
                                                                              [12] L. Ren and Y. Peng, “Research of fall detection and fall prevention
    previous methods (Table III-‘All frames’) during testing.                      technologies: A systematic review,” IEEE Access, vol. 7, pp. 77 702–
2) The proposed methods achieves similar and better AUC of                         77 722, 2019.
    ROC than previous methods and higher AUC of PR against                    [13] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis,
                                                                                   “Learning temporal regularity in video sequences,” in Proceedings of the
    all previous methods. The method focuses on region in the                      IEEE Conference on Computer Vision and Pattern Recognition, 2016,
    frame where a person is present; therefore, it can facilitate                  pp. 733–742.
    learning of background agnostic models.                                   [14] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni,
                                                                                   and N. Sebe, “Abnormal event detection in videos using generative
                                                                                   adversarial nets,” in 2017 IEEE International Conference on Image
           VII. C ONCLUSION AND F UTURE W ORK                                      Processing (ICIP). IEEE, 2017, pp. 1577–1581.
                                                                              [15] S. S. Khan, M. E. Karg, D. Kulić, and J. Hoey, “Detecting falls with
   Fall detection is a non-trivial problem due to large imbal-                     x-factor hidden markov models,” Applied Soft Computing, vol. 55, pp.
ance in data; thus, we formulate it as an anomaly detection                        168–177, 2017.
                                                                              [16] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially
problem. In this problem, we trained the model on only normal                      learned one-class classifier for novelty detection,” in Proceedings of the
ADL and predicts whether the test sample is normal ADL or                          IEEE Conference on Computer Vision and Pattern Recognition, 2018,
a fall. Building upon the advantages of adversarial learning                       pp. 3379–3388.
                                                                              [17] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
paradigm, we present a two channel adversarial learning                            learning with deep convolutional generative adversarial networks,” arXiv
framework to learn spatio-temporal features by extracting ROI                      preprint arXiv:1511.06434, 2015.
and its generated optical flow followed by a joint discriminator.             [18] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
                                                                                   based fully convolutional networks,” in Advances in neural information
We note that the introduction of person-ROI and difference                         processing systems, 2016, pp. 379–387.
loss function increases the performance. The major improve-                   [19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
ment in comparison to previous methods is the increase in                          P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
                                                                                   context,” in European Conference on Computer Vision. Springer, 2014,
AUC of PR curve. The optical flow information is also useful                       pp. 740–755.
in the network and the fused method performs better than the                  [20] N. Otsu, “A threshold selection method from gray-level histograms,”
raw thermal analysis only. In the future, we plan to extend                        IEEE Transactions on Systems, Man, and Cybernetics, pp. 62–66, 1979.
                                                                              [21] S. Suzuki et al., “Topological structural analysis of digitized binary
the proposed techniques to detect falls using multiple camera                      images by border following,” Computer Vision, Graphics, and Image
modalities, including depth and IP cameras.                                        Processing, pp. 32–46, 1985.
                                                                              [22] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video
                             R EFERENCES                                           prediction beyond mean square error,” arXiv preprint arXiv:1511.05440,
                                                                                   2015.
 [1] S. S. Khan and J. Hoey, “Review of fall detection techniques: A data     [23] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction
     availability perspective,” Medical engineering & physics, vol. 39, pp.        for anomaly detection–a new baseline,” in Proceedings of the IEEE
     12–22, 2017.                                                                  Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
 [2] J. Nogas, S. S. Khan, and A. Mihailidis, “Deepfall: Non-invasive              pp. 6536–6545.
     fall detection with deep spatio-temporal convolutional autoencoders,”    [24] G. Farnebäck, “Two-frame motion estimation based on polynomial
     Journal of Healthcare Informatics Research, vol. 4, no. 1, pp. 50–70,         expansion,” in Scandinavian conference on Image analysis. Springer,
     2020.                                                                         2003, pp. 363–370.
 [3] S. Vadivelu, S. Ganesan, O. R. Murthy, and A. Dhall, “Thermal imaging    [25] G. Gkioxari and J. Malik, “Finding action tubes,” in The IEEE Confer-
     based elderly fall detection,” in Asian Conference on Computer Vision.        ence on Computer Vision and Pattern Recognition (CVPR), June 2015.
     Springer, 2016, pp. 541–553.