Fused Audio Instance and Representation for Respiratory Disease Detection
Abstract
:1. Introduction
- We demonstrate that cough, breath, and speech sounds can be leveraged to detect COVID-19 in a multi-instance audio classification approach based on self-attention fusion. Our experimental results indicate that combining multiple audio instances exceeds the performance of single-instance baselines.
- We experimentally show that an audio-based classification approach can benefit from combining waveform and spectrogram representations of input signals. In other words, inputting the time- and frequency-domain dual representations into the network allows for a richer latent feature space, ultimately improving the overall classification performance.
- We integrate the above contributions into the FAIR approach, a method that combines multiple instances of body sound in waveform and spectrogram representations to classify negative and positive COVID-19 individuals. The FAIR approach is a general concept that can be applied to other sound classification tasks such as those related to other respiratory diseases.
2. Related Work
3. Methods
3.1. Feature Extractors
3.2. Fusion Unit and Classifier
4. Experiment
4.1. Dataset
4.2. Data Preprocessing and Augmentation
4.3. Baseline and Benchmark Experiments
4.4. Cross-Validation
4.5. Hyperparameters
4.6. Training
4.7. Evaluation
5. Results
5.1. Baseline Results
5.2. Benchmark Results
6. Discussion
7. Challenges and Limitations
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
MLP | Multilayer perceptron |
ViT | Vision transformer |
COVID-19 | Coronavirus disease 2019 |
STFT | Short-time Fourier transform |
SARS-CoV-2 | Severe acute respiratory syndrome coronavirus 2 |
MFCC | Mel frequency cepstral coefficients |
ZCR | Zero-crossing rate |
KNN | K-nearest neighbors |
SVM | Support vector machine |
COPD | Chronic obstructive pulmonary disease |
RNN | Recurrent neural network |
LSTM | Long short-term memory |
FAIR | Fused audio instance and representation |
ROC | Receiver operating characteristic |
AUC | Area under the ROC curve |
Appendix A. Additional Experimental Results
Appendix A.1. Self-Attention Fusion with Only Waveform Inputs
Feature Extractor | wav2vec (BE1) | ||||||
---|---|---|---|---|---|---|---|
Body Sound | AUC | Sensitivity | Specificity | AUPRC | Precision | F1 | Accuracy |
Speech | 0.7562 ± 0.0152 | 0.3557 ± 0.0409 | 0.7592 ± 0.0586 | 0.3794 ± 0.0824 | 0.7028 ± 0.0689 | 0.4685 ± 0.0282 | 0.7504 ± 0.0460 |
Cough + Breath | 0.6739 ± 0.0435 | 0.2694 ± 0.0363 | 0.7200 ± 0.0524 | 0.1583 ± 0.0181 | 0.7199 ± 0.0584 | 0.3904 ± 0.0453 | 0.6469 ± 0.0669 |
Cough + Speech | 0.7644 ± 0.0088 | 0.3922 ± 0.0771 | 0.7906 ± 0.0937 | 0.4218 ± 0.0283 | 0.6628 ± 0.1056 | 0.4799 ± 0.0413 | 0.7708 ± 0.0729 |
Breath + Speech | 0.7682 ± 0.0149 | 0.3747 ± 0.0675 | 0.6743 ± 0.0966 | 0.4526 ± 0.0167 | 0.6744± 0.1082 | 0.4705 ± 0.0343 | 0.7593 ± 0.0692 |
Cough + Breath + Speech | 0.7717 ± 0.0128 | 0.3358 ± 0.0347 | 0.7236 ± 0.0669 | 0.3991 ± 0.0409 | 0.7460 ± 0.0757 | 0.4594 ± 0.0235 | 0.7265 ± 0.0514 |
Appendix A.2. Self-Attention Fusion with Only Spectrogram Inputs
Feature Extractor | DeiT-S/16 (BE2) | ||||||
---|---|---|---|---|---|---|---|
Body sound | AUC | Sensitivity | Specificity | AUPRC | Precision | F1 | Accuracy |
Speech | 0.8081 ± 0.0239 | 0.7486 ± 0.0775 | 0.7717 ± 0.0711 | 0.5584 ± 0.0183 | 0.3865 ± 0.0598 | 0.5043 ± 0.0435 | 0.7681 ± 0.0570 |
Cough + Breath | 0.7685 ± 0.0183 | 0.6400 ± 0.0642 | 0.8293 ± 0.0718 | 0.4749 ± 0.0680 | 0.4283 ± 0.0839 | 0.5043 ± 0.0443 | 0.8000 ± 0.0572 |
Cough + Speech | 0.8315 ± 0.0306 | 0.7371 ± 0.0836 | 0.7927 ± 0.0892 | 0.5728 ± 0.0558 | 0.4229 ± 0.1148 | 0.5251 ± 0.0802 | 0.7841 ± 0.0756 |
Breath + Speech | 0.8122 ± 0.0125 | 0.6571 ± 0.0313 | 0.8796 ± 0.0298 | 0.5879 ± 0.0615 | 0.5077 ± 0.0623 | 0.5699 ± 0.0339 | 0.8451 ± 0.0250 |
Cough + Breath + Speech | 0.8241 ± 0.0266 | 0.6914 ± 0.0796 | 0.8408 ± 0.0838 | 0.6159 ± 0.0174 | 0.4741 ± 0.1064 | 0.5502 ± 0.0658 | 0.8177 ± 0.0696 |
Counting (fast + normal) (*) | 0.7467 ± 0.0124 | 0.6629 ± 0.0946 | 0.7790 ± 0.0774 | 0.4456 ± 0.0368 | 0.2611 ± 0.0603 | 0.3860 ± 0.0553 | 0.5849 ± 0.1302 |
Phoneme (/a/-/e/-/o/) (*) | 0.7806 ± 0.0208 | 0.7886 ± 0.0100 | 0.6827 ± 0.0753 | 0.4311 ± 0.0258 | 0.2705 ± 0.0637 | 0.3956 ± 0.0681 | 0.5752 ± 0.2151 |
Feature Extractor | ResNet50 | ||
---|---|---|---|
Body Sound | AUC | Sensitivity | Specificity |
Speech | 0.7531 ± 0.0362 | 0.7314 ± 0.0983 | 0.6817 ± 0.0818 |
Cough + Breath | 0.7585 ± 0.0259 | 0.6400 ± 0.0859 | 0.8188 ± 0.0832 |
Cough + Speech | 0.7817 ± 0.0282 | 0.8000 ± 0.1352 | 0.6628 ± 0.0992 |
Breath + Speech | 0.7862 ± 0.0238 | 0.7314 ± 0.0878 | 0.7466 ± 0.1058 |
Cough + Breath + Speech | 0.8026 ± 0.0229 | 0.6914 ± 0.1120 | 0.7959 ± 0.1175 |
Appendix A.3. FAIR
Feature Extractors | DeiT-S/16 & wav2vec (BE3) | ||||||
---|---|---|---|---|---|---|---|
Body Sound | AUC | Sensitivity | Specificity | AUPRC | Precision | F1 | Accuracy |
Speech | 0.8434 ± 0.0290 | 0.7429 ± 0.0767 | 0.8356 ± 0.0266 | 0.5566 ± 0.0371 | 0.4551 ± 0.0235 | 0.5619 ± 0.0234 | 0.8212 ± 0.0146 |
Cough + Breath | 0.7585 ± 0.0174 | 0.6629 ± 0.0874 | 0.8168 ± 0.0754 | 0.4971 ± 0.0698 | 0.4199 ± 0.0806 | 0.5030 ± 0.0370 | 0.7222 ± 0.1352 |
Cough + Speech | 0.8584 ± 0.0308 | 0.8171 ± 0.1063 | 0.7738 ± 0.0977 | 0.6016 ± 0.0648 | 0.4205 ± 0.0836 | 0.5447 ± 0.0588 | 0.7805 ± 0.0768 |
Breath + Speech | 0.8319 ± 0.0187 | 0.7771 ± 0.0554 | 0.7895 ± 0.0644 | 0.6330 ± 0.0529 | 0.4164± 0.0690 | 0.5365 ± 0.5455 | 0.7876 ± 0.5455 |
Cough + Breath + Speech | 0.8658 ± 0.0115 | 0.8057 ± 0.0554 | 0.7958 ± 0.0678 | 0.6383 ± 0.0255 | 0.4352 ± 0.0796 | 0.5584 ± 0.0506 | 0.7974 ± 0.0546 |
Counting (fast + normal) (*) | 0.7702 ± 0.0313 | 0.7086 ± 0.0836 | 0.7717 ± 0.0470 | 0.5009 ± 0.0347 | 0.2851 ± 0.0626 | 0.4000 ± 0.0549 | 0.6221 ± 0.1796 |
Phoneme (/a/-/e/-/o/) (*) | 0.7906 ± 0.0095 | 0.7886 ± 0.0530 | 0.6848 ± 0.0499 | 0.4743 ± 0.0417 | 0.3544 ± 0.1242 | 0.4429 ± 0.0567 | 0.6805 ± 0.1248 |
Appendix A.4. Remarks Regarding the AUPRC
Appendix B. Ablation Study
Appendix B.1. Feature Extractors
Feature Extractor | 1D-CNN4 | ResNet50 | ||||
---|---|---|---|---|---|---|
Body Sound | AUC | Sensitivity | Specificity | AUC | Sensitivity | Specificity |
Cough-heavy | 0.6396 ± 0.0839 | 0.8800 ± 0.0662 | 0.4042 ± 0.0555 | 0.6855 ± 0.0607 | 0.6571 ± 0.0767 | 0.7025 ± 0.0531 |
Breath-deep | 0.6559 ± 0.0355 | 0.5886 ± 0.0690 | 0.7194 ± 0.1241 | 0.7387 ± 0.0244 | 0.5829 ± 0.0878 | 0.8545 ± 0.0561 |
Counting-fast | 0.7162 ± 0.0400 | 0.5943 ± 0.1134 | 0.7885 ± 0.0829 | 0.7162 ± 0.0400 | 0.5943 ± 0.1134 | 0.7885 ± 0.0829 |
Counting-normal | 0.6519 ± 0.0071 | 0.5143 ± 0.0866 | 0.7665 ± 0.0957 | 0.7082 ± 0.0395 | 0.5371 ± 0.0911 | 0.8188 ± 0.0570 |
Phoneme /a/ | 0.7014 ± 0.0387 | 0.6057 ± 0.1321 | 0.7560 ± 0.1331 | 0.7014 ± 0.0387 | 0.6057 ± 0.1321 | 0.7560 ± 0.1331 |
Phoneme /e/ | 0.6588 ± 0.0627 | 0.6571 ± 0.1743 | 0.6461 ± 0.1314 | 0.6588 ± 0.0627 | 0.6571 ± 0.1743 | 0.6461 ± 0.1314 |
Phoneme /o/ | 0.6327 ± 0.0145 | 0.6400 ± 0.0736 | 0.6345 ± 0.0880 | 0.7004 ± 0.0785 | 0.8057 ± 0.0874 | 0.5780 ± 0.1591 |
Appendix B.2. Fusion Unit
Feature Extractor | 1D-CNN4 | ResNet50 | ||||
---|---|---|---|---|---|---|
Body Sound | AUC | Sensitivity | Specificity | AUC | Sensitivity | Specificity |
Speech | 0.7235 ± 0.0052 | 0.5543 ± 0.0464 | 0.8335 ± 0.0399 | 0.7531 ± 0.0362 | 0.7314 ± 0.0983 | 0.6817 ± 0.0818 |
Cough + Breath | 0.6900 ± 0.0145 | 0.7886 ± 0.1273 | 0.5278 ± 0.1150 | 0.7585 ± 0.0259 | 0.6400 ± 0.0859 | 0.8188 ± 0.0832 |
Cough + Speech | 0.7362 ± 0.0081 | 0.6229 ± 0.0732 | 0.7770 ± 0.0651 | 0.7817 ± 0.0282 | 0.8000 ± 0.1352 | 0.6628 ± 0.0992 |
Breath + Speech | 0.7351 ± 0.0060 | 0.5943 ± 0.0214 | 0.8492 ± 0.0332 | 0.7862 ± 0.0238 | 0.7314 ± 0.0878 | 0.7466 ± 0.1058 |
Cough + Breath + Speech | 0.7596 ± 0.0122 | 0.6914 ± 0.0709 | 0.7539 ± 0.0668 | 0.8026 ± 0.0229 | 0.6914 ± 0.1120 | 0.7959 ± 0.1175 |
Feature Extractors | DeiT-S/16 & wav2vec | |||||
---|---|---|---|---|---|---|
Fusion Rules | Attention-Weighted Pooling | Self-Attention | ||||
Body Sound | AUC | Sensitivity | Specificity | AUC | Sensitivity | Specificity |
Speech | 0.8161 ± 0.0238 | 0.8172 ± 0.1166 | 0.6932 ± 0.1125 | 0.8434 ± 0.0290 | 0.7429 ± 0.0767 | 0.8356 ± 0.0266 |
Cough + Breath | 0.7865 ± 0.0173 | 0.6514 ± 0.0911 | 0.8544 ± 0.0673 | 0.7585 ± 0.0174 | 0.6629 ± 0.0874 | 0.8168 ± 0.0754 |
Cough + Speech | 0.8267 ± 0.0102 | 0.8628 ± 0.0457 | 0.6911 ± 0.0525 | 0.8584 ± 0.0308 | 0.8171 ± 0.1063 | 0.7738 ± 0.0977 |
Breath + Speech | 0.8197 ± 0.0317 | 0.7257 ± 0.0690 | 0.8168 ± 0.0638 | 0.8319 ± 0.0187 | 0.7771 ± 0.0554 | 0.7895 ± 0.0644 |
Cough + Breath + Speech | 0.8313 ± 0.0176 | 0.6743 ± 0.0428 | 0.0867 ± 0.0278 | 0.8658 ± 0.0115 | 0.8057 ± 0.0554 | 0.7958 ± 0.0678 |
Counting (fast + normal) (*) | 0.7756 ± 0.0434 | 0.7086 ± 0.0911 | 0.7833 ± 0.0512 | 0.7702 ± 0.0313 | 0.7086 ± 0.0836 | 0.7717 ± 0.0470 |
Vowel (a, e, o) (*) | 0.7863 ± 0.0211 | 0.7486 ± 0.0754 | 0.7068 ± 0.0797 | 0.7906 ± 0.0095 | 0.7886 ± 0.0530 | 0.6848 ± 0.0499 |
Appendix C. Model Complexity
Input Body Sounds | Number of Trainable Parameters |
---|---|
Speech | 22.07M |
Cough + Breath | 21.97 M |
Cough + Speech | 22.10 M |
Breath + Speech | 22.10 M |
Cough + Breath + Speech | 22.14 M |
References
- De Meyer, M.M.; Jacquet, W.; Vanderveken, O.M.; Marks, L.A. Systematic review of the different aspects of primary snoring. Sleep Med. Rev. 2019, 45, 88–94. [Google Scholar] [CrossRef] [PubMed]
- Sarkar, M.; Madabhavi, I.; Niranjan, N.; Dogra, M. Auscultation of the respiratory system. Ann. Thorac. Med. 2015, 10, 158. [Google Scholar] [CrossRef] [PubMed]
- Song, I. Diagnosis of pneumonia from sounds collected using low cost cell phones. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–8. [Google Scholar] [CrossRef]
- Laguarta, J.; Hueto, F.; Subirana, B. COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings. IEEE Open J. Eng. Med. Biol. 2020, 1, 275–281. [Google Scholar] [CrossRef]
- Botha, G.H.R.; Theron, G.; Warren, R.M.; Klopper, M.; Dheda, K.; van Helden, P.D.; Niesler, T.R. Detection of tuberculosis by automatic cough sound analysis. Physiol. Meas. 2018, 39, 045005. [Google Scholar] [CrossRef]
- Altan, G.; Kutlu, Y.; Allahverdi, N. Deep Learning on Computerized Analysis of Chronic Obstructive Pulmonary Disease. IEEE J. Biomed. Health Inform. 2020, 24, 1344–1350. [Google Scholar] [CrossRef] [PubMed]
- Zhang, H.; Song, C.; Wang, A.; Xu, C.; Li, D.; Xu, W. PDVocal: Towards Privacy-preserving Parkinson’s Disease Detection using Non-speech Body Sounds. In Proceedings of the 25th Annual International Conference on Mobile Computing and Networking, Los Cabos, Mexico, 21–25 October 2019; pp. 1–16. [Google Scholar] [CrossRef]
- Kalkbrenner, C.; Eichenlaub, M.; Rüdiger, S.; Kropf-Sanchen, C.; Rottbauer, W.; Brucher, R. Apnea and heart rate detection from tracheal body sounds for the diagnosis of sleep-related breathing disorders. Med Biol. Eng. Comput. 2018, 56, 671–681. [Google Scholar] [CrossRef]
- Astuti, I.; Ysrafil. Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): An overview of viral structure and host response. Diabetes Metab. Syndr. Clin. Res. Rev. 2020, 14, 407–412. [Google Scholar] [CrossRef]
- Scheiblauer, H.; Filomena, A.; Nitsche, A.; Puyskens, A.; Corman, V.M.; Drosten, C.; Zwirglmaier, K.; Lange, C.; Emmerich, P.; Müller, M.; et al. Comparative sensitivity evaluation for 122 CE-marked rapid diagnostic tests for SARS-CoV-2 antigen, Germany, September 2020 to April 2021. Eurosurveillance 2021, 26, 2100441. [Google Scholar] [CrossRef] [PubMed]
- Huang, Y.; Meng, S.; Zhang, Y.; Wu, S.; Zhang, Y.; Zhang, Y.; Ye, Y.; Wei, Q.; Zhao, N.; Jiang, J.; et al. The respiratory sound features of COVID-19 patients fill gaps between clinical data and screening methods. medRxiv 2020. [Google Scholar] [CrossRef]
- Al Ismail, M.; Deshmukh, S.; Singh, R. Detection of Covid-19 Through the Analysis of Vocal Fold Oscillations. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1035–1039. [Google Scholar] [CrossRef]
- Shimon, C.; Shafat, G.; Dangoor, I.; Ben-Shitrit, A. Artificial intelligence enabled preliminary diagnosis for COVID-19 from voice cues and questionnaires. J. Acoust. Soc. Am. 2021, 149, 1120–1124. [Google Scholar] [CrossRef]
- Suppakitjanusant, P.; Sungkanuparph, S.; Wongsinin, T.; Virapongsiri, S.; Kasemkosin, N.; Chailurkit, L.; Ongphiphadhanakul, B. Identifying individuals with recent COVID-19 through voice classification using deep learning. Sci. Rep. 2021, 11, 19149. [Google Scholar] [CrossRef] [PubMed]
- Pahar, M.; Klopper, M.; Reeve, B.; Warren, R.; Theron, G.; Niesler, T. Automatic cough classification for tuberculosis screening in a real-world environment. Physiol. Meas. 2021, 42, 105014. [Google Scholar] [CrossRef]
- Xu, X.; Nemati, E.; Vatanparvar, K.; Nathan, V.; Ahmed, T.; Rahman, M.M.; McCaffrey, D.; Kuang, J.; Gao, J.A. Listen2Cough: Leveraging End-to-End Deep Learning Cough Detection Model to Enhance Lung Health Assessment Using Passively Sensed Audio. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies; ACM: New York, NY, USA, 2021; Volume 5, pp. 1–22. [Google Scholar] [CrossRef]
- Khanaghavalle, G.; Rahul, G.; Senajith, S.; Vishnuvasan, T.; Keerthana, S. Chronic Obstructive Pulmonary Disease Severity Classification using lung Sound. In Proceedings of the 2024 10th International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, India, 12–14 April 2024; pp. 428–432. [Google Scholar]
- Luo, K.; Yang, G.; Li, Y.; Lan, S.; Wang, Y.; He, L.; Hu, B. Croup and pertussis cough sound classification algorithm based on channel attention and multiscale Mel-spectrogram. Biomed. Signal Process. Control. 2024, 91, 106073. [Google Scholar] [CrossRef]
- Kim, B.J.; Kim, B.S.; Mun, J.H.; Lim, C.; Kim, K. An accurate deep learning model for wheezing in children using real world data. Sci. Rep. 2022, 12, 22465. [Google Scholar] [CrossRef]
- Petmezas, G.; Cheimariotis, G.A.; Stefanopoulos, L.; Rocha, B.; Paiva, R.P.; Katsaggelos, A.K.; Maglaveras, N. Automated Lung Sound Classification Using a Hybrid CNN-LSTM Network and Focal Loss Function. Sensors 2022, 22, 1232. [Google Scholar] [CrossRef]
- Choi, Y.; Lee, H. Interpretation of lung disease classification with light attention connected module. Biomed. Signal Process. Control. 2023, 84, 104695. [Google Scholar] [CrossRef] [PubMed]
- Serrurier, A.; Neuschaefer-Rube, C.; Röhrig, R. Past and Trends in Cough Sound Acquisition, Automatic Detection and Automatic Classification: A Comparative Review. Sensors 2022, 22, 2896. [Google Scholar] [CrossRef]
- Xia, T.; Han, J.; Mascolo, C. Exploring machine learning for audio-based respiratory condition screening: A concise review of databases, methods, and open issues. Exp. Biol. Med. 2022, 247, 2053–2061. [Google Scholar] [CrossRef] [PubMed]
- Orlandic, L.; Teijeiro, T.; Atienza, D. The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms. Sci. Data 2021, 8, 156. [Google Scholar] [CrossRef]
- Sharma, N.; Krishnan, P.; Kumar, R.; Ramoji, S.; Chetupalli, S.R.; R., N.; Ghosh, P.K.; Ganapathy, S. Coswara—A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis. In Proceedings of the Interspeech 2020, Virtual Event, Shanghai, China, 25–29 October 2020; pp. 4811–4815. [Google Scholar] [CrossRef]
- Brown, C.; Chauhan, J.; Grammenos, A.; Han, J.; Hasthanasombat, A.; Spathis, D.; Xia, T.; Cicuta, P.; Mascolo, C. Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3474–3484. [Google Scholar] [CrossRef]
- Fakhry, A.; Jiang, X.; Xiao, J.; Chaudhari, G.; Han, A.; Khanzada, A. Virufy: A Multi-Branch Deep Learning Network for Automated Detection of COVID-19. arXiv 2021, arXiv:2103.01806. [Google Scholar]
- Meister, J.A.; Nguyen, K.A.; Luo, Z. Audio feature ranking for sound-based COVID-19 patient detection. arXiv 2021, arXiv:2104.07128. [Google Scholar]
- Pahar, M.; Klopper, M.; Warren, R.; Niesler, T. COVID-19 cough classification using machine learning and global smartphone recordings. Comput. Biol. Med. 2021, 135, 104572. [Google Scholar] [CrossRef]
- Topuz, E.K.; Kaya, Y. SUPER-COUGH: A Super Learner-based ensemble machine learning method for detecting disease on cough acoustic signals. Biomed. Signal Process. Control. 2024, 93, 106165. [Google Scholar] [CrossRef]
- Rao, S.; Narayanaswamy, V.; Esposito, M.; Thiagarajan, J.; Spanias, A. Deep Learning with hyper-parameter tuning for COVID-19 Cough Detection. In Proceedings of the 2021 12th International Conference on Information, Intelligence, Systems & Applications (IISA), Chania Crete, Greece, 12–14 July 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- Xia, T.; Spathis, D.; Brown, C.; Chauhan, J.; Grammenos, A.; Han, J.; Hasthanasombat, A.; Bondareva, E.; Dang, T.; Floto, A.; et al. COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening. In Proceedings of the 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Virtual, 6–14 December 2021; pp. 1–13. [Google Scholar]
- Wall, C.; Zhang, L.; Yu, Y.; Kumar, A.; Gao, R. A Deep Ensemble Neural Network with Attention Mechanisms for Lung Abnormality Classification Using Audio Inputs. Sensors 2022, 22, 5566. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- Truong, T.; Mohammadi, S.; Lenga, M. How Transferable are Self-supervised Features in Medical Image Classification Tasks. In Machine Learning for Health; PMLR: London, UK, 2021; pp. 54–74. ISSN 2640-3498. [Google Scholar]
- Wanasinghe, T.; Bandara, S.; Madusanka, S.; Meedeniya, D.; Bandara, M.; de la Torre Díez, I. Lung sound classification with multi-feature integration utilizing lightweight CNN model. IEEE Access 2024, 12, 21262–21276. [Google Scholar] [CrossRef]
- Griffin, D.; Lim, J. Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 236–243. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 10347–10357. [Google Scholar]
- Bhattacharya, D.; Sharma, N.K.; Dutta, D.; Chetupalli, S.R.; Mote, P.; Ganapathy, S.; Chandrakiran, C.; Nori, S.; Suhail, K.K.; Gonuguntla, S.; et al. Coswara: A respiratory sounds and symptoms dataset for remote screening of SARS-CoV-2 infection. Sci. Data 2023, 10, 397. [Google Scholar] [CrossRef] [PubMed]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Body Sound | Min (s) | Max (s) | Median (s) | Mean (s) |
---|---|---|---|---|
Heavy cough | 1.58 | 30.04 | 6.06 | 6.27 |
Deep breath | 2.65 | 30.04 | 16.30 | 17.08 |
Normal counting | 1.62 | 29.95 | 14.34 | 14.58 |
Fast counting | 1.86 | 29.95 | 7.94 | 8.00 |
Phoneme /a/ | 1.19 | 29.95 | 10.03 | 10.53 |
Phoneme /e/ | 1.28 | 29.95 | 10.96 | 11.73 |
Phoneme /o/ | 1.37 | 29.95 | 10.41 | 11.19 |
No. | Representation | Architecture | Body Sound Fusion | No. Models |
---|---|---|---|---|
BA1 | Waveform | wav2vec | None | 7 |
BA2 | Spectrogram | DeiT-S/16 | None | 7 |
BE1 | Waveform | wav2vec | Attention | 5 |
BE2 | Spectrogram | DeiT-S/16 | Attention | 5 |
BE3 | Spectrogram Waveform | DeiT-S/16 wav2vec | Attention | 5 |
Subset | Label | Trial 1 | Trial 2 | Trial 3 | Trial 4 | Trial 5 |
---|---|---|---|---|---|---|
Train | Negative | 761 | 756 | 751 | 760 | 752 |
Positive | 146 | 151 | 155 | 146 | 154 | |
Validation | Negative | 184 | 189 | 194 | 185 | 193 |
Positive | 42 | 37 | 33 | 42 | 34 |
Architecture | wav2vec | DeiT-S/16 | FAIR | ||
---|---|---|---|---|---|
Body sound fusion | None | Attention | None | Attention | Attention |
Optimizer | AdamW | AdamW | AdamW | AdamW | AdamW |
Base learning rate | |||||
Weight decay | |||||
Optimizer momentum | (0.9, 0.99) | (0.9, 0.99) | (0.9, 0.99) | (0.9, 0.99) | (0.9, 0.99) |
Batch size | 32 | 32 | 32 | 32 | 32 |
Training epochs | 30 | 30 | 30 | 30 | 30 |
Learning rate scheduler | cosine | cosine | cosine | cosine | cosine |
Warmup epochs | 10 | 10 | 10 | 10 | 10 |
Loss function | BCE | BCE | BCE | BCE | BCE |
Input Body Sound | wav2vec (BA1) | DeiT-S/16 (BA2) |
---|---|---|
Cough—heavy | 0.4574 ± 0.0093 | 0.7782 ± 0.0132 |
Breath—deep | 0.6597 ± 0.0222 | 0.7552 ± 0.0254 |
Counting—fast | 0.7090 ± 0.0136 | 0.7291 ± 0.0196 |
Counting—normal | 0.6285 ± 0.0155 | 0.7943 ± 0.0326 |
Phoneme /a/ | 0.6484 ± 0.0150 | 0.7418 ± 0.0399 |
Phoneme /e/ | 0.6209 ± 0.0197 | 0.7399 ± 0.0318 |
Phoneme /o/ | 0.5649 ± 0.0293 | 0.7457 ± 0.0288 |
Average | 0.6127 ± 0.0751 | 0.7549 ± 0.0215 |
Model | p-Value | |||
---|---|---|---|---|
Input Body Sounds | wav2vec (BE1) | DeiT-S/16 (BE2) | FAIR (BE3) | BE2 vs. BE3 |
Speech | 0.7562 ± 0.0152 | 0.8081 ± 0.0239 | 0.8434 ± 0.0290 | < 0.001 |
Cough + Breath | 0.6739 ± 0.0435 | 0.7685 ± 0.0183 | 0.7585 ± 0.0174 | 0.5000 |
Cough + Speech | 0.7644 ± 0.0088 | 0.8315 ± 0.0306 | 0.8584 ± 0.0308 | 0.2460 |
Breath + Speech | 0.7682 ± 0.0149 | 0.8122 ± 0.0125 | 0.8319 ± 0.0187 | 0.0137 |
Cough + Breath + Speech | 0.7717 ± 0.0128 | 0.8241 ± 0.0266 | 0.8658 ± 0.0115 | 0.0019 |
Average | 0.7469 ± 0.0369 | 0.8089 ± 0.0218 | 0.8316 ± 0.0384 |
Input Body Sound | DeiT-S/16 & wav2vec |
---|---|
Cough—heavy | 0.7426 ± 0.0268 |
Breath—deep | 0.7661 ± 0.0113 |
Counting—fast | 0.7698 ± 0.0204 |
Counting—normal | 0.7581 ± 0.0938 |
Phoneme /a/ | 0.7577 ± 0.0213 |
Phoneme /e/ | 0.7299 ± 0.0174 |
Phoneme /o/ | 0.7394 ± 0.0168 |
Average | 0.7519 ± 0.0137 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Truong, T.; Lenga, M.; Serrurier, A.; Mohammadi, S. Fused Audio Instance and Representation for Respiratory Disease Detection. Sensors 2024, 24, 6176. https://doi.org/10.3390/s24196176
Truong T, Lenga M, Serrurier A, Mohammadi S. Fused Audio Instance and Representation for Respiratory Disease Detection. Sensors. 2024; 24(19):6176. https://doi.org/10.3390/s24196176
Chicago/Turabian StyleTruong, Tuan, Matthias Lenga, Antoine Serrurier, and Sadegh Mohammadi. 2024. "Fused Audio Instance and Representation for Respiratory Disease Detection" Sensors 24, no. 19: 6176. https://doi.org/10.3390/s24196176
APA StyleTruong, T., Lenga, M., Serrurier, A., & Mohammadi, S. (2024). Fused Audio Instance and Representation for Respiratory Disease Detection. Sensors, 24(19), 6176. https://doi.org/10.3390/s24196176