Human activity recognition (HAR) technology that analyzes data acquired from various types of sensing devices, including wearable sensors and vision sensors, is getting considerable attention in the field of Artificial Intelligence (AI) driven healthcare systems. Human activities can be used to provide remote healthcare solutions by identifying particular movements such as falls, gait, and breathing disorders. HAR healthcare system can allow people to live more independent lifestyles and still have the safety of being monitored if more direct care is needed. Thanks to the development of machine learning technology, many machine learning methods have been employed in human activity recognition systems in healthcare. However, this field still faces many technical challenges. Some challenges are shared with other pattern recognition fields, such as a limited number of labeled data, while other challenges are unique to sensor-based activity recognition in healthcare and require dedicated methods for real-life healthcare applications, such as data noise of sensor factors in the data collection process.
In this dissertation, we start with the challenges of healthcare-oriented HAR systems and summarize the challenge-related machine learning approaches. To overview HAR healthcare applications with wearable sensors, we cover essential components of designing HAR healthcare systems, including sensor factors (e.g., type, number, and placement location), AI model selection (e.g., classical machine learning models versus deep learning models), and feature engineering.
Next, we present a new healthcare application of HAR, that is, Early Mobility Activity (EMA) recognition for Intensive Care Unit (ICU) patients, to illustrate the system design of HAR applications for healthcare. We identify insensitive wearable sensor orientation features and propose a segment voting process to improve the model accuracy and stability.
We further apply the state-of-the-art vision sensor-based HAR approaches in healthcare. We present a healthcare system (BWCNN) to use eye blinks to communicate with the outside world for Amyotrophic Lateral Sclerosis (ALS) patients. The system uses a Convolutional Neural Network (CNN) to predict the eyes' state, which is used to find the blinking pattern.
Then, we propose a MASTAF that can quickly learn from a few examples efficiently to solve the limited number of video samples in real-life HAR applications, a common challenge shared with computer vision. MASTAF takes input from a general video spatial and temporal representation,e.g., using 2D CNN, 3D CNN, and video Transformer. Then, to make the most of such representations, we use self- and cross-attention models to highlight the critical spatio-temporal region to increase the inter-class distance and decrease the intra-class distance. Last, MASTAF applies a lightweight fusion network and the nearest neighbor classifier to classify each query video. We demonstrate that MASTAF improves the state-of-the-art performance on three few-shot HAR video benchmarks.
Last, we present Multimodal Masked Autoencoders-Based One-Shot Learning (Mu-MAE), which represents a significant advancement in the field of HAR using multimodal sensors. Addressing the challenges posed by labor-intensive data collection and reliance on external pretrained models, MU-MAE introduces a synchronized masking strategy tailored for wearable sensors, coupled with a multimodal masked autoencoder architecture. This innovative approach compels the networks to capture more meaningful spatiotemporal features, facilitating effective self-supervised pretraining without the need for additional data. Furthermore, MU-MAE leverages the representations extracted from multimodal masked autoencoders to enhance cross-attention fusion, which highlights critical spatiotemporal features across different modalities while emphasizing differences between activity classes. Through comprehensive evaluations on MMAct one-shot classification datasets, MU-MAE demonstrates superior performance, achieving up to an 80.17% accuracy for five-way one-shot multimodal classification, thus establishing itself as a state-of-the-art solution in HAR for healthcare applications.