Machine learning for data streams (MLDS) attempts to extract knowledge from a stream of non-IID data. It has been a significant research area since the late 1990s, with increasing adoption in the industry over the past few years due to the emergence of Industry 4.0, where more industry processes are monitored online. Practitioners are presented with challenges such as detecting and adapting to concept drifts, continuously evolving models, and learning from partially labeled and unlabeled data.
Despite commendable efforts in open-source libraries, a gap persists between pioneering research and accessible tools, presenting challenges for practitioners, including experienced data scientists, in implementing and evaluating methods in this complex domain. Our tutorial addresses this gap with a dual focus. We discuss advanced research topics, such as partially delayed labeled streams, while providing practical demonstrations of their implementation and assessment using Python. By catering to both researchers and practitioners, this tutorial aims to empower users in designing, conducting experiments, and extending existing methodologies.
In this tutorial, our objective is to familiarize attendees with applying diverse machine-learning tasks to streaming data. Beyond an introductory overview, where we delineate the learning cycle of typical supervised learning tasks, we steer our focus towards pertinent challenges such as: Prediction Intervals for regression tasks; Concept drift detection, visualization and evaluation; Modelling and addressing partially and delayed labeled data streams using semi-supervised and The idiosyncrasies of applying clustering on a data stream.
- IJCAI_2024_introduction.ipynb
- IJCAI_2024_drift.ipynb
- IJCAI_2024_supervised.ipynb
- IJCAI_2024_prediction_intervals.ipynb
- IJCAI_2024_advanced.ipynb
Heitor is a senior lecturer at the Victoria University of Wellington (VuW) in New Zealand. Before joining VuW, Heitor was a senior research fellow and co-director of the AI Institute at the University of Waikato were he taught from 2020 to 2022 the "data stream mining" (COMPX523) course. Heitor's main research area is the application of machine learning for data streams in a variety of tasks. In this field, he has contributed to ensemble learning for both regression and classification tasks, worked on unsupervised drift detection, and in 2023, he was awarded a grant to conduct research on developing novel theories and algorithms for partially delayed labeled streams. Besides participating as PC member of a multitude of conferences (KDD, IJCAI, ECML, PAKDD, ...) Heitor is also an active contributor to open-source projects like MOA (Massive Online Analysis), StreamDM (a real-time analytics open-source software library built on top of Spark Streaming), and river (where he supervises students and postdocs since the inception of the project).
Nuwan earned his PhD in "Advanced Adaptive Classifier Methods for Data Streams" from the University of Waikato, and he currently works at the AI Institute of the same university. His research interests primarily revolve around Stream Learning, Online Continual Learning, and Online Streaming Continual Learning. He has delivered a guest lecture and talks at the University of Waikato's Data Stream Mining (OMPX523 Masters) course and Cardiff University's Machine Learning Seminar.
Nuwan has contributed to this field by working on streaming gradient boosted trees for classification and regression, developing neural network-based methods for data streams, and exploring the intersection between Stream Learning and Online Continual Learning.
Nuwan's research has been featured in esteemed publications like IJCAI, Springer Machine Learning, and IJCNN. He also served as a PC member for IJCAI 24 Survey Track. Nuwan actively contributes to MOA (Massive Online Analysis) and CapyMOA Stream Learning Platforms.
Professor Albert Bifet is the Director of the Te Ipu o te Mahara AI Institute at the University of Waikato and Co-chair of the Artificial Intelligence Researchers Association (AIRA). His research focuses on Artificial Intelligence, Big Data Science, and Machine Learning for Data Streams. He is leading the TAIAO Environmental Data Science project and co-leading the open source projects MOA (Massive On-line Analysis), StreamDM for Spark Streaming and SAMOA (Scalable Advanced Massive Online Analysis). He is the co-author of a book on Machine Learning from Data Streams published at MIT Press. He is one of the winners of the best paper award at the ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) 2023, and he will be the general co-chair of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD) 2024.
Bernhard Pfahringer received his PhD degree from the University of Technology in Vienna, Austria, in 1995. He is a Professor with the Department of Computer Science, and a co-director for the AI Institute, at the University of Waikato in New Zealand. His interests span a range of data mining and machine learning sub-fields, with a focus on streaming, randomization, and complex data. Bernhard is the co-author of the book "Machine Learning from Data Streams" published at MIT Press.
Short list of tutorials:
- Bifet A., Pfahringer B.: Hands-on Tutorial on Massive Online Analytics. KDD 2017.
- Weka: A Tool for Exploratory Data Mining. IEEE Symposium Series on Computational Intelligence 2007.
- Witten I.H., Frank E., Pfahringer B., Hall M.: Inside WEKA – and Beyond the Book, Tutorial at ICML 2002.