Proceedings of the Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)
IEEE Xplore Part Number:CFP20OSV-ART; ISBN: 978-1-7281-5464-0
Human Activity Identification using CNN
Neha Junagade Shailesh Kulkarni
Department of E&TC Engg. Department of E&TC Engg.
Vishwakarma Institute of Information Technology, Vishwakarma Institute of Information Technology,
Pune, India Pune, India
nehajunagade94@gmail.com Shailesh.kulkarni@viit.ac.in
Abstract— Human activity recognition [HAR] is a field of older people looking at the increasing need and demand of care
study that deals with identifying, interpreting, and analyzing the and support required nowadays.
actions specific to the movement of human beings. Currently, the
activity recognition system like (HAR) is becoming a huge field of Convolutional Neural Networks (CNN) are mainly used to
innovative work with an emphasis on advanced machine learning classify, analyze, and cluster images by similarity. They also
algorithms, innovations that focus on increasing safety while perform object recognition within a frame. For example,
decreasing the costs of monitoring, which helps in the field of convolutional neural networks are used to identify people,
healthcare, child care, surveillance, sports or keeping track of human faces, objects, etc. CNN are neural networks that
behavioral pattern of human beings. This model aims to develop assume that the inputs are images that are represented in the
a system that recognizes activities like sitting, standing, walking, form of pixels, which is a 2-dimensional array of numbers
sleeping, reading, and tilting using CNN. It is done by a between 0 and 255 .To enable computers to work with images
supervised learning method, which is an ML task where a each image has a digital representation. Building CNN
function is trained that provides output by mapping it to input, involves multiple operations. The first operation is the
i.e., the activity will be recognized based on the activity convolution. Convolution operation works on two signals in
defined/labeled in the data. 1D, whereas two images in 2D. It shows how one out of 2
functions modifies the other. It detects features in images such
Keywords— CNN, HAR, ML.
as lines, edges, etc.
I. INT RODUCT ION Once it has learned a feature in an image, it can recognize it
Human activity recognition (HAR) in Computer Vision is a any part of it. CNN's make use of filters or feature detectors, to
challenging and interesting research topic. In this paper, a detect features. While building CNN, to initialize the neural
profound model based on CNN is proposed where human network, it is required to develop layers like the Sequential
activities are recognizes based on input data provided while layer. Then to make the convolutional network Convolution2D
training the model. Generally HAR model involves starting layer is developed that deals with the images. Then
from collecting input to a conclusion where currently MaxPooling2D layer is formed to add pooling layers. Post
performed activity is identified. There are multiple applications which Flatten layer is developed in which the pooled feature
or fields where HAR concepts are used to develop relevant map is converted to a single column, which is further passed to
systems in Health Care, Child Care, or fields where keeping the fully connected layer. Finally, a dense layer adds fully
track of the Behavioral pattern of Human beings is required. connected layers to a neural network. Once the network is
built, it is ready to be trained as per requirements.
It mainly aims at observing and analyzing activities of
human to explain ongoing events successfully. It was Considering training on data for human activity detection,
traditionally done by human operators, for example, in the Training on SVM algorithms involves steps including
process of monitoring a patient's health condition, child care, preprocessing such as filtering, feature extraction, and feature
security, home care and surveillance processes. In the case of selection. The pads of images used for training are fitted on the
keeping track of the Behavioral pattern of Human beings with hyperplane accordingly to predict the accurate position of
the help of Human operators, there may be challenges like humans. Training on CNN is a pretty quick process. It is all
preconceived notions, biased approach, unconscious bias, or done in cascade-based operation, which is in-order like
poor judgment, which may impact the efficiency of the convolution and filtering the poses also adjusts weights
behavioral pattern recognition. In the case of human operators, accordingly. It has fully connected layers for reaching the final
it becomes challenging, especially in the case of around-the- mark.
clock operations; multiple operators are required, which In contrast, SVM preprocessing involves various operation
increases cost. In numerous instances, it is not a financially individual scales making it complicated. Training on RNN is
feasible option. In most cases, HAR systems can improve the
the best way for predicting the pose of humans because its
efficiency of the observation and analysis of the process. For feature of recurring based on additional weight incurs the
example, with the help of sensory devices HAR can keep track assignment of weights to every movement in the training image
of the daily activities, health condition, identify the change in which increases accuracy as well as precision than CNN
the behavioral pattern of a patient and also notify in case of an (results shown in the paper for our method is better than
emergency which can be mainly useful in case of children and RNN...so modification need to be highlighted)—considering
the testing time taken by the SVM algorithm to predict the
978-1-7281-5464-0/20/$31.00 ©2020 IEEE 1058
Authorized licensed use limited to: California State University Fresno. Downloaded on June 20,2021 at 23:27:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)
IEEE Xplore Part Number:CFP20OSV-ART; ISBN: 978-1-7281-5464-0
initial frame is quick but degrades in the next frame. The (CNNs). In [15], the type of deep model convolutional neural
prediction rate of the CNN algorithm is faster than SVM and network (CNN) is proposed for HAR, which can operate
gives perfect results to individual frames; however, in live directly on the raw inputs. In this paper, the high computational
stream, circumstances might face loss in frames. Overall the cost of kernel training has been reduced to improve real-world
methods used the most accurate, quick, and fast are RNN's as applications using an efficient pre-training strategy. In this
the weight segregation for individual pose makes it more paper, the KTH database has been used to test the proposed
efficient. Considering the time for training, SVM takes approach. The achieved results show favorable results against
enormous time for training on custom data. If not properly state-of-the-art algorithms using hand-designed features.
trained, it behaves weird. In the case of CNN, the moderate
time required for training and the smallest dataset shows III. PROPOSED SYSTEM
moderate prediction. Using pre-trained weights such as Mobile
Net etc. ensure the best accuracy. In the case of RNN, it is th e The proposed system for human activity recognition (HAR) is
fastest among all techniques used and best on accuracy and based on supervised machine learning, where a function is
precision. Pre-trained models such as InceptionV3 show the trained that provides output by mapping it to input, i.e., the
best results. A combination of CNN for training and RNN for activity will be recognized based on the activity
re-training enhances the model and fits real-time detection. defined/labeled in the input. Further reading on HAR can be
found in [7][8][9]. The primary sequence of gross operation in
II. LITERATURE SURVEY
the activity recognition is training, testing, and evaluation with
further sub-operations, as depicted in figure 1. Training begins
Research on human activity recognition started as early as with frame extraction where an input video is extracted in
the 1990s. The late decade saw a significant number of terms of frames, which is to be trained and further converted to
creations in the field of visual surveillance to analyze activities. a set of images, which would be the data set for defining
In HAR, Feature selection is also a field of research, and many multiple activities for the model. The preprocessing step
people have studied the application of different feature undertakes the resizing of images to 100X100 to ensure the
selection methods. According to evaluation criteria, Feature uniformity among input data and then normalized.
selection can be divided into three categories [1] as a filter
method, wrapper method, and embedded method. From the
computational cost and effectiveness perspective, Zhang and Further, the hyper parameters are derived in our model in the
Sawchuk [2] compared three different feature selection Keras library. Four layers in the intermediate neural network
methods (Relief-F, SFC, and SFS). The number of features layer, which have values 16, 32, 1024, and 512 are defined
increased from 5 to 110. Results showed that when 50 features experimentally. The third layer is the densest layer, and the
are included, the classification errors taper off across three result of total non-trainable parameters obtained while training
feature selection methods. the model is zero. The experimental epoch is taken as 500 to
get maximum accuracy. Example: if epoch = 1, it can provide
Amongst all Relief-F has the lowest computational cost, an accuracy of 20.70% and when epoch = 2, accuracy is
whereas SFS has the highest computational cost. When
24.30% .When epoch = 5, accuracy is 27.12%. It was observed
compared to fast correlation-based filter and correlation-based
that as the number of epochs increases, accuracy increases. The
feature selection in [3], Relief-F proved to be the best feature
value of the epoch helps decide weights for the model while
selection algorithm as it could deal with incomplete and noisy
data. There are other feature selection methods, e.g., trace ratio designing the model. Here, a Batch size of 32 images, which is
criterion [4], oppositional-based binary kidney-inspired in line with most of them along with defining the color
algorithm [5], and He et al. [6], which demonstrated the channels are selected for the model. Post defining the
performance of the Laplacian score on Iris and PIE face parameters, the neural network is trained for human activity
datasets. In [10], Tripathi, R. K et al. present the state-of-the- recognition data. Here the length of input video used to train
art, which shows the overall progress of suspicious activity the model is 40 sec. The number of frames extracted from the
recognition in case of surveillance videos in the last decade. video is 952. The number of samples used to train the classifier
Paper [11] concentrates on the applications of activity on each condition is as follows: 149 frames were used for
recognition systems and surveys [12] proposes to classify training Reading activity, 192 frames were used for training
human activities that use raw data obtained from a set of Sitting activity, 93 frames were used for training Sleeping
inertial sensors by using CNN. Paper [13] present CNNs activity, 191 frames were used for training Standing activity,
(CNNpf and CNN-pff), especially CNN-pff, for multimodal 46 were used for training for Tilting activity, 281 frames were
data. Paper [14] proposed a novel deep neural network for used for training Walking activity and video used for testing
HAR, which handles sequence measurements from different had a sample size of 1000 Frames. While testing the trained
body-worn devices separately. Here three datasets, the model, the trained model is initially loaded with the video in
Opportunity, Pamap2, and an industrial dataset, are used to which activities are to be recognized. All the activities are
evaluate architecture. sequentially labeled in the training model. Features from the
Along with these different networks, configurations will current frame are extracted and compared with the features
also be evaluated. It helps us conclude that applying available in the trained model. Coordinates in the frame and
convolutions per sensor channel and per body-worn device color of text are defined as the activity recognition output label
improves the performance of a convolutional neural network displayed in the output frame.
978-1-7281-5464-0/20/$31.00 ©2020 IEEE 1059
Authorized licensed use limited to: California State University Fresno. Downloaded on June 20,2021 at 23:27:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)
IEEE Xplore Part Number:CFP20OSV-ART; ISBN: 978-1-7281-5464-0
The model is evaluated for its accuracy by using the confusion
matrix. Modification in the existing model with re-training
until a model fits best as per requirement, like the desired
accuracy, the cost is exercised.
Figure. 1. Block diagram of the proposed system
Following are the steps experimented for the end-to-end
operation of the proposed system: Figure. 2. Activity detection using the proposed model
Step 1) Explore the dataset and create the training data set as Figure 2 shows activities recognized while testing the model
required. like Walking, Reading, Sitting, Sleeping, Standing, and Tilting
labeled as a, b, c, d, e, and f, respectively. The performance of
Step 2) Extract frames from the videos. the proposed model is evaluated with the help of a confusion
matrix, which helps visualize accuracy with intra and interclass
Step 3) Preprocess these frames and then train a model using variability for training and tested output with cross-validation.
the frames in training set by the following steps: a) Read all the
frames that are extracted earlier for the training images. b) Analysis of the result on the sample test sequence reveals the
Define the architecture of our model. c) Finally, train the model efficacy of the proposed model for activities like sleeping,
and save its weights. tilting as depicted in Table I. However, walking and reading
activities seek special attention, either training set or
Step 4) Test the model using the frames present in t he improvisation of the number of hidden layers.
validation set using the following steps: a) define the model
TABLE I. CONFUSION MATRIX OF RESULTS FOR ACTIVITIES.
architecture and load the weights. b) Create the test data, c)
Make predictions for the test videos. Reading Sleeping Sitting Standing Tilting Walking
Walking 0 0 0 107 0 891
Step 5) finally, evaluate the model. The trained model can be
Reading 782 0 98 0 0 0
used to classify new videos depending on the performance, or
required changes can be made. Sleeping 0 747 3 0 0 0
Standing 0 0 0 487 0 13
IV. RESULT
Sitting 8 0 312 0 0 0
Input video of activities like Walking, Reading, Sitting, T ilting 0 0 0 8 242 0
Sleeping, Standing, and Tilting is fed to the system while
training. While testing all the activities recognized. The TABLE II. CONFUSION MATRIX T ABLE .
classified result is given in Figure 2, respectively.
Predicted
Non-
Detected(Condition Detected(Condition
Positive)
Negative)
Detected
(Predicted
747 [T P] 3[T N]
condition
Positive)
Actual
Non-Detected
(Predicted
8[FP] 242[FN]
condition
Negative)
Where: P = positive; N = Negative; TP = True Positive; FP =
False Positive; TN = True Negative; FN = False Negative.
978-1-7281-5464-0/20/$31.00 ©2020 IEEE 1060
Authorized licensed use limited to: California State University Fresno. Downloaded on June 20,2021 at 23:27:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)
IEEE Xplore Part Number:CFP20OSV-ART; ISBN: 978-1-7281-5464-0
A final accuracy of 98.90%, 98.94% precision, and True 97.92%, respectively. This CNN model can automatically learn
Positive (Sensitivity) of 99.60%, are achieved. the required features from input (raw) data to make accurate
predictions. New datasets or videos can be used, which can be
No of Data set files taken: - 1000, Positive (P):-750, Negative adopted quickly and cheaply by using the same model. The
(N):- 250. The manual calculations are given below:- model is suitable for predicting Sleeping, Tilting majorly. The
model has a scope of improvement in Walking and reading
activities, which can be done using more advanced neural
network designs or making few modifications in the model.
Correct positive rate (Sensitivity) is given by
VI. REFERENCES
[1] G. Chandrashekar and F. Sahin, A Survey on Feature Selection
Methods, Pergamon Press, Inc., Oxford, UK, 2014. M. Zhang and A. A.
Sawchuk, “A feature selection-based framework for human activity
recognition using wearable multimodal sensors,” in Proceedings of
ACM Conference on Ubiquitous Computing, pp. 1036–1043, Beijing,
Table III summarizes the performance of experimented China, September 2011.
methods for accuracy, precision, and false negative. Figure 3 [2] I. Kononenko, “Analysis and extension of RELIEF,” in Proceedings of
T he European Conference on Machine Learning and Principles and
depicts the same in graphical form for quick reference. Practice of Knowledge Discovery in Databases, Catania, Italy, April
1994.
TABLE III. P ERFORMANCE COMP ARISON WITH EXISTING METHODS. [3] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan, “T race ratio criterion for
feature selection,” in Proceedings of the 23rd AAAI Conference on
Parameters Artificial Intelligence, pp. 671–676, Chicago, IL, USA, July 2008.
Sr. No Me thods
Accuracy Precision False-negative [4] M. K. T aqi and R. Ali, "OBKA-FS: an oppositional-based binary kidney
inspired search algorithm for feature selection," Journal of T heoretical
1 SVM 92.50% 93.57% 3.00% and Applied Information T echnology, vol. 95, no. 1, 2017.
[5] X. He, D. Cai, and P. Niyogi, “Laplacian score for feature selection,” in
Proceedings of International Conference on Neural Information
2 RCNN 97.92% 96.38% 2.70% Processing Systems, pp. 507–514, MIT Press, Vancouver, Canada,
December 2005.
[6] Suneth Ranasinghe, Fadi Al Machot and Heinrich C Mayr, "A review on
3 Proposed method 98.90% 98.94% 0.40% applications of activity recognition systems concerning performance and
evaluation," International Journal of Distributed Sensor Networks 2016,
Vol. 12(8)
[7] Alsheikh, M.A., Selim, A., Niyato, D., Doyle, L., Lin, S., T an, H.P.:
Deep activity recognition models with triaxial accelerometers. CoRR
abs/1511.04664 (2015). http://arxiv.org/abs/1511.04664
[8] Improving CNN-based activity recognition by data augmentation and
transfer learning Gerasimos Kalouris Evangelia I. Zacharaki VVR
Group, Vasileios Megalooikonomou XXX-X-XXXX-XXXX-
X/XX/$XX.00 ©20XX IEEE
[9] Shugang Zhang,1 Zhiqiang Wei,1 Jie Nie,2 Lei Huang,1 Shuang
Wang,1 and Zhen Li, “A Review on Human Activity Recognition Using
Vision-Based Method”, Hindawi Journal of Healthcare Engineering
Volume 2017
[10] T ripathi, R. K., Jalal, A. S., & Agrawal, S. C. (2017). Suspicious human
activity recognition: a review. Artificial Intelligence Review, 50(2),
283–339. doi:10.1007/s10462-017-9545-7
[11] Suneth Ranasinghe, Fadi Al Machot and Heinrich C Mayr, "A review on
applications of activity recognition systems with regard to performance
and evaluation," International Journal of Distributed Sensor Networks
2016, Vol. 12(8)
Figure 3: Graphical representation of comparison with existing methods
[12] Bevilacqua, A., MacDonald, K., Rangarej, A., Widjaya, V., Caulfield,
B., & Kechadi, T . (2019). Human Activity Recognition with
V. CONCLUSION Convolutional Neural Networks. Lecture Notes in Computer Science,
541–552. doi:10.1007/978-3-030-10997-4_33
This paper has presented the experimental results of a CNN, [13] Ha, S., & Choi, S. (2016). Convolutional neural networks for human
SVM, and RCNN model for the HAR. This research work has activity recognition using multiple accelerometer and gyroscope sensors.
concentrated on training, testing, and evaluation of multiple 2016 Int ernational Joint Conference on Neural Networks
(IJCNN). doi:10.1109/ijcnn.2016.7727224
exercises. The accuracy obtained is 98.90% for a sample data
[14] Article Convolutional Neural Networks for Human Activity Recognition
set using a confusion matrix where the accuracy percentage Using Body-Worn Sensors Fernando Moya Rueda, René Grzeszick,
obtained for SVM Model and RCNN model is 92.50% and Gernot A. Fink, Sascha Feldhorst and Michael ten Hompel
978-1-7281-5464-0/20/$31.00 ©2020 IEEE 1061
Authorized licensed use limited to: California State University Fresno. Downloaded on June 20,2021 at 23:27:24 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)
IEEE Xplore Part Number:CFP20OSV-ART; ISBN: 978-1-7281-5464-0
[15] Human Action Recognition based on Convolutional Neural Networks
with a Convolutional Auto-Encoder Chi Geng, A, Jian Xin Song 5th
International Conference on Computer Sciences and Automation
Engineering (ICCSAE 2015)
978-1-7281-5464-0/20/$31.00 ©2020 IEEE 1062
Authorized licensed use limited to: California State University Fresno. Downloaded on June 20,2021 at 23:27:24 UTC from IEEE Xplore. Restrictions apply.