research-article

Incremental Data Drifting: Evaluation Metrics, Data Generation, and Approach Comparison

Authors:

Shou-de LinAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 4

Article No.: 71, Pages 1 - 26

https://doi.org/10.1145/3655630

Published: 25 July 2024 Publication History

Abstract

Incremental data drifting is a common problem when employing a machine-learning model in industrial applications. The underlying data distribution evolves gradually, e.g., users change their buying preferences on an E-commerce website over time. The problem needs to be addressed to obtain high performance. Right now, studies regarding incremental data drifting suffer from several issues. For one thing, there is a lack of clear-defined incremental drift datasets for examination. Existing efforts use either collected real datasets or synthetic datasets that show two obvious limitations. One is in particular when and of which type of drifts the distribution undergoes is unknown, and the other is that a simple synthesized dataset cannot reflect the complex representation we would normally face in the real world. For another, there lacks a well-defined protocol to evaluate a learner’s knowledge transfer capability on an incremental drift dataset. To provide a holistic discussion on these issues, we create approaches to generate datasets with specific drift types, and define a novel protocol for evaluation. Besides, we investigate recent advances in the transfer learning field, including Domain Adaptation and Lifelong Learning, and examine how they perform in the presence of incremental data drifting. The results unfold the relationships among drift types, knowledge preservation, and learning approaches.

References

[1]

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant risk minimization. arXiv preprint arXiv:1907.02893 (2019).

[2]

Manuel Baena-García, José Campo-Ávila, Raúl Fidalgo-Merino, Albert Bifet, Ricard Gavald, and Rafael Morales-Bueno. 2006. Early drift detection method. (012006).

[3]

J. Blackard and D. Dean. 1999. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture 24 (1999), 131–151.

[4]

Andrew Brock, Jeff Donahue, and Karen Simonyan. 2019. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR ’19).

[5]

Alberto Cano and Bartosz Krawczyk. 2020. Kappa updated ensemble for drifting data stream mining. Machine Learning 109 (012020), 175–218.

Digital Library

[6]

Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. 2021. Environment inference for invariant learning. In International Conference on Machine Learning. PMLR, 2189–2200.

[7]

Gabriela Csurka. 2017. Domain Adaptation in Computer Vision Applications. Springer.

[8]

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2022. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 7 (2022), 3366–3385.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.

[10]

Ryan Elwell and Robi Polikar. 2011. Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Networks 22, 10 (2011), 1517–1531.

Digital Library

[11]

Wei Fan. 2004. Systematic data selection to mine concept-drifting data streams. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 128–137.

Digital Library

[12]

Jordi Fonollosa, Irene Rodríguez-Luján, and Ramón Huerta. 2015. Chemical gas sensor array dataset. Data in Brief 3 (2015), 85–89.

[13]

João Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. 2004. Learning with drift detection. Intelligent Data Analysis 8, 286–295.

[14]

Joáo Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 1–37.

Digital Library

[15]

João Gama, Indrundefined Žliobaitundefined, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4, Article 44 (2014).

Digital Library

[16]

Yaroslav Ganin, E. Ustinova, Hana Ajakan, Pascal Germain, H. Larochelle, François Laviolette, M. Marchand, and V. Lempitsky. 2016. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17 (2016), 59:1–59:35.

[17]

Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. 2017. A survey on ensemble learning for data stream classification. ACM Comput. Surv. 50, 2 (Mar.2017), 1–36.

Digital Library

[18]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial Networks. In Advances in Neural Information Processing Systems (NeurIPS ’14).

[19]

Ruocheng Guo, Pengchuan Zhang, Hao Liu, and Emre Kiciman. 2021. Out-of-distribution prediction with invariant risk minimization: The limitation and an effective fix. arXiv preprint arXiv:2101.07732 (2021).

[20]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. arXiv preprint arXiv:1807.11205 (2018).

[21]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations (ICLR ’18).

[22]

Tero Karras, S. Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 4396–4405.

[23]

Ioannis Katakis, Grigorios Tsoumakas, and I. Vlahavas. 2010. Tracking recurring contexts using ensemble classifiers: An application to email filtering. Knowledge and Information Systems 22 (032010), 371–391.

Digital Library

[24]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR ’15).

[25]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, 13 (2017), 3521–3526.

[26]

David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. 2021. Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning. PMLR, 5815–5826.

[27]

Ananya Kumar, Tengyu Ma, and Percy Liang. 2020. Understanding self-training for gradual domain adaptation. In Proceedings of the 37th International Conference on Machine Learning. 5468–5479.

Digital Library

[28]

Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. 2022. TRGP: Trust region gradient projection for continual learning. In International Conference on Learning Representations.

[29]

David Lopez-Paz and Marc'Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems.

[30]

Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joáo Gama, and Guangquan Zhang. 2018. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 31, 12 (Oct.2018), 2346–2363.

[31]

Dominic Masters and Carlo Luschi. 2018. Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612 (2018).

[32]

Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP ’19). 188–197.

[33]

Vishal M. Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. 2015. Visual domain adaptation: A survey of recent advances. IEEE Signal Process. Mag. 32, 3 (Apr.2015), 53–69.

[34]

Amanda Rios and Laurent Itti. 2019. Closed-loop memory GAN for continual learning. 3332–3338.

[35]

Mohammad Rostami and Aram Galstyan. 2023. Overcoming concept shift in domain-aware settings through consolidated internal distributions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 9623–9631.

Digital Library

[36]

Kuniaki Saito, Kohei Watanabe, Y. Ushiku, and T. Harada. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 3723–3732.

[37]

Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. 2020. InterFaceGAN: Interpreting the disentangled face representation learned by GANs. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (2020).

[38]

Yujun Shen and Bolei Zhou. 2021. Closed-form factorization of latent semantics in GANs. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’21).

[39]

Hanul Shin, Jung Lee, Jaehong Kim, and Jiwon Kim. 2017. Continual learning with deep generative replay. (052017).

[40]

Rui Shu, Hung Bui, Hirokazu Narui, and Stefano Ermon. 2018. A DIRT-T approach to unsupervised domain adaptation. (022018).

[41]

Yiliao Song, Jie Lu, Anjin Liu, Haiyan Lu, and Guangquan Zhang. 2021. A segment-based drift adaptation method for data streams. IEEE Transactions on Neural Networks and Learning Systems (2021).

[42]

Yiliao Song, Jie Lu, Haiyan Lu, and Guangquan Zhang. 2021. Learning data streams with changing distributions and temporal dependency. IEEE Transactions on Neural Networks and Learning Systems (2021).

[43]

W. Nick Street and YongSeog Kim. 2001. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 377–382.

Digital Library

[44]

Rishabh Tiwari, Krishnateja Killamsetty, Rishabh Iyer, and Pradeep Shenoy. 2022. GCR: Gradient coreset based replay buffer selection for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 99–108.

[45]

Alexander Vergara, Shankar Vembu, Tuba Ayhan, Margaret A. Ryan, Margie L. Homer, and Ramón Huerta. 2012. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical 166-167 (2012), 320–329.

[46]

Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 226–235.

Digital Library

[47]

Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey. Neurocomputing 312 (2018), 135–153.

Digital Library

[48]

Scott Wares, John Isaacs, and Eyad Elyan. 2019. Data stream mining: Methods and challenges for handling concept drift. SN Appl. Sci. 1, 11 (2019), 1–19.

[49]

Gerhard Widmer and Miroslav Kubat. 1996. Learning in the presence of concept drift and hidden contexts. Mach. Learn. 23, 1 (Apr.1996), 69–101.

[50]

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems (NeurIPS ’19).

[51]

Marvin Mengxin Zhang, Henrik Marklund, Nikita Dhawan, Abhishek Gupta, Sergey Levine, and Chelsea Finn. 2021. Adaptive risk minimization: Learning to adapt to domain shift. In Advances in Neural Information Processing Systems.

[52]

Zhifei Zhang, Yang Song, and Hairong Qi. 2017. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’17).

Index Terms

Incremental Data Drifting: Evaluation Metrics, Data Generation, and Approach Comparison
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

SOINN+, a Self-Organizing Incremental Neural Network for Unsupervised Learning from Noisy Data Streams
Highlights
- We present a new self-organizing incremental neural network: SOINN+.
- SOINN+ is for unsupervised, continuous learning.
- SOINN+ can detect clusters in noisy, evolving data streams.
- SOINN+ mitigates catastrophic forgetting.
Abstract
The goal of continuous learning is to acquire and fine-tune knowledge incrementally without erasing already existing knowledge. How to mitigate this erasure, known as catastrophic forgetting, is a grand challenge for machine learning, ...
An approach of support approximation to discover frequent patterns from concept-drifting data streams based on concept learning

In an online data stream, the composition and distribution of the data may change over time, which is a phenomenon known as concept drift. The occurrence of concept drift can affect considerably the performance of a data stream mining method, especially ...
Tracking concept drifting with an online-optimized incremental learning framework
MIR '05: Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval

Concept drifting is an important and challenging research issue in the field of machine learning. This paper mainly addresses the issue of semantic concept drifting in time series such as video streams over a relatively long period of time. An Online-...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 15, Issue 4

August 2024

563 pages

EISSN:2157-6912

DOI:10.1145/3613644

Editor:
Huan Liu
Arizona State University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2024

Online AM: 24 May 2024

Accepted: 28 February 2024

Revised: 23 January 2024

Received: 18 April 2023

Published in TIST Volume 15, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science and Technology Council (NSTC) of Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
294
Total Downloads

Downloads (Last 12 months)294
Downloads (Last 6 weeks)20

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents