research-article

Open access

Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

Authors:

Sadeem Alsudais,

Chen LiAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 6

Article No.: 248, Pages 1 - 26

https://doi.org/10.1145/3698832

Published: 20 December 2024 Publication History

Abstract

Data analytics tasks are often formulated as data workflows represented as directed acyclic graphs (DAGs) of operators. The recent trend of adopting machine learning (ML) techniques in workflows results in increasingly complicated DAGs with many operators and edges. Compared to the operator-at-a-time execution paradigm, pipelined execution has benefits of reducing the materialization cost of intermediate results and allowing operators to produce results early, which are critical in iterative analysis on large data volumes. Correctly scheduling a workflow DAG for pipelined execution is non-trivial due to the richer semantics of operators and the increasing complexity of DAGs. Several existing data systems adopt simple heuristics to solve the problem without considering costs such as materialization sizes. In this paper, we systematically study the problem of scheduling a workflow DAG for pipelined execution, and develop a novel cost-based optimizer called Pasta for generating a high-quality schedule. The Pasta optimizer is not only general and applicable to a wide variety of cost functions, but also capable of utilizing properties inherent in a broad class of cost functions to improve its performance significantly. We conducted a thorough evaluation of developed techniques on real-world workflows and show the efficiency and efficacy of these solutions.

References

[1]

Alteryx 2024. AI Analytics Platform - Alteryx, https://www.alteryx.com/.

[2]

Apache Flink 2024. Apache Flink® - Stateful Computations over Data Streams | Apache Flink, https://flink.apache.org.

[3]

Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11--16, 2011, Hannover, Germany, Serge Abiteboul, Klemens Böhm, Christoph Koch, and Kian-Lee Tan (Eds.). IEEE Computer Society, 1151--1162. https://doi.org/10.1109/ICDE.2011.5767921

Digital Library

[4]

Luc Bouganim, Daniela Florescu, and Patrick Valduriez. 1996. Dynamic Load Balancing in Hierarchical Parallel Database Systems. In VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3--6, 1996, Mumbai (Bombay), India, T. M. Vijayaraman, Alejandro P. Buchmann, C. Mohan, and Nandlal L. Sarda (Eds.). Morgan Kaufmann, 436--447. http://www.vldb.org/conf/1996/P436.PDF

Digital Library

[5]

Nilesh N. Dalvi, Sumit K. Sanghai, Prasan Roy, and S. Sudarshan. 2001. Pipelining in Multi-Query Optimization. In Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 21--23, 2001, Santa Barbara, California, USA, Peter Buneman (Ed.). ACM. https://doi.org/10.1145/375551.375561

Digital Library

[6]

Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Zoi Kaoudi, Tilmann Rabl, and Volker Markl. 2022. Materialization and Reuse Optimizations for Production Data Science Pipelines. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022. ACM, 1962--1976. https://doi.org/10.1145/3514221.3526186

Digital Library

[7]

Docker Swarm 2024. Swarm mode | Docker Docs, https://docs.docker.com/engine/swarm/.

[8]

Yannis Foufoulas and Alkis Simitsis. 2023. User-Defined Functions in Modern Data Engines. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3--7, 2023. IEEE, 3593--3598. https://doi.org/10.1109/ICDE55515.2023.00276

[9]

Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Lorand Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2021. Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021. IEEE, 1428--1439. https://doi.org/10.1109/ICDE51399.2021.00127

[10]

Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. 2016. GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2--4, 2016, Kimberly Keeton and Timothy Roscoe (Eds.). USENIX Association, 81--97. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/grandl_graphene

[11]

Waqar Hasan and Rajeev Motwani. 1994. Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism. In VLDB'94, Proceedings of 20th International Conference on Very Large Data Bases, September 12--15, 1994, Santiago de Chile, Chile, Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo (Eds.). Morgan Kaufmann, 36--47. http://www.vldb.org/conf/1994/P036.PDF

Digital Library

[12]

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy H. Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011, David G. Andersen and Sylvia Ratnasamy (Eds.). USENIX Association. https://www.usenix.org/conference/nsdi11/mesos-platform-fine-grained-resource-sharing-data-center

[13]

KNIME 2024. Open for Innovation | KNIME, https://www.knime.com/.

[14]

KNIME Community Workflows 2024. Workflows | KNIME Community Hub, https://hub.knime.com/search?type=Workflow.

[15]

Kubernetes 2024. Kubernetes, https://kubernetes.io/.

[16]

Avinash Kumar, Zuozhi Wang, Shengquan Ni, and Chen Li. 2020. Amber: A Debuggable Dataflow System Based on the Actor Model. Proc. VLDB Endow. 13, 5 (2020), 740--753. https://doi.org/10.14778/3377369.3377381

Digital Library

[17]

V. S. Anil Kumar, Madhav V. Marathe, Srinivasan Parthasarathy, and Aravind Srinivasan. 2009. Scheduling on Unrelated Machines under Tree-Like Precedence Constraints. Algorithmica 55, 1 (2009), 205--226. https://doi.org/10.1007/S00453-007--9004-Y

Digital Library

[18]

Xiaozhen Liu, Zuozhi Wang, Shengquan Ni, Sadeem Alsudais, Yicong Huang, Avinash Kumar, and Chen Li. 2022. Demonstration of Collaborative and Interactive Workflow-Based Data Analytics in Texera. Proc. VLDB Endow. 15, 12 (2022), 3738--3741. https://www.vldb.org/pvldb/vol15/p3738-liu.pdf

Digital Library

[19]

Ming-Ling Lo, Ming-Syan Chen, Chinya V. Ravishankar, and Philip S. Yu. 1993. On Optimal Processor Allocation to Support Pipelined Hash Joins. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26--28, 1993, Peter Buneman and Sushil Jajodia (Eds.). ACM Press, 69--78. https://doi.org/10.1145/170035.170053

Digital Library

[20]

Pipelined Regions in Apache Flink 2020. Improvements in task scheduling for batch workloads in Apache Flink 1.12, https://flink.apache.org/2020/12/02/improvements-in-task-scheduling-for-batch-workloads-in-apache-flink-1.12/.

[21]

Raghu Ramakrishnan and Johannes Gehrke. 2002. Database Management Systems. WCB/McGraw-Hill.

Digital Library

[22]

RapidMiner 2024. Data Analytics and AI Platform | Altair RapidMiner, https://rapidminer.com/.

[23]

Vladislav Shkapenyuk, Ryan Williams, Stavros Harizopoulos, and Anastassia Ailamaki. 2005. Deadlock resolution in pipelined query graphs. Carnegie Mellon University Technical Report.

[24]

Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, and Benjamin Recht. 2017. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19--22, 2017. IEEE Computer Society, 535--546. https://doi.org/10.1109/ICDE.2017.109

[25]

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin C. Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In ACM Symposium on Cloud Computing, SOCC '13, Santa Clara, CA, USA, October 1--3, 2013, Guy M. Lohman (Ed.). ACM, 5:1--5:16. https://doi.org/10.1145/2523616.2523633

Digital Library

[26]

Zuozhi Wang, Yicong Huang, Shengquan Ni, Avinash Kumar, Sadeem Alsudais, Xiaozhen Liu, Xinyuan Lin, Yunyan Ding, and Chen Li. 2024. Texera: A System for Collaborative and Interactive Data Analytics Using Workflows. Proc. VLDB Endow. 17, 11 (2024), 3580--3588. https://www.vldb.org/pvldb/vol17/p3580-wang.pdf

Digital Library

[27]

Zuozhi Wang and Chen Li. 2023. Building a Collaborative Data Analytics System: Opportunities and Challenges. Proc.VLDB Endow. 16, 12 (2023), 3898--3901. https://doi.org/10.14778/3611540.3611580

Digital Library

[28]

Yinggen Xu, Liu Liu, and Zhijun Ding. 2020. DAG-Aware Joint Task Scheduling and Cache Management in Spark Clusters. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, May 18--22, 2020. IEEE, 378--387. https://doi.org/10.1109/IPDPS47924.2020.00047

[29]

Zhihui Yang, Zuozhi Wang, Yicong Huang, Yao Lu, Chen Li, and X. Sean Wang. 2022. Optimizing Machine Learning Inference Queries with Correlative Proxy Models. Proc. VLDB Endow. 15, 10 (2022), 2032--2044. https://www.vldb.org/pvldb/vol15/p2032-yang.pdf

Digital Library

[30]

Shaoyi Yin, Abdelkader Hameurlain, and Franck Morvan. 2015. Robust Query Optimization Methods With Respect to Estimation Errors: A Survey. SIGMOD Rec. 44, 3 (2015), 25--36. https://doi.org/10.1145/2854006.2854012

Digital Library

[31]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25--27, 2012. 15--28.

Digital Library

Cited By

Wu XGong QChen JLiu QPodhorszki NLiang XKlasky S(2024)Error-controlled Progressive Retrieval of Scientific Data under Derivable Quantities of InterestProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00092(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00092

Index Terms

Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization
  2. Information systems applications
    1. Computing platforms
    2. Decision support systems
      1. Data analytics
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Software architectures
        Data flow architectures

Recommendations

Area-maximizing schedules for series-parallel DAGs
Euro-Par'10: Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II

Earlier work introduced a new optimization goal for DAG schedules: the "AREA" of the schedule. AREA-maximizing schedules are intended for computational environments--such as Internet-based computing and massively multicore computers--that benefit from ...
Cost-Time Efficient Scheduling Plan for Executing Workflows in the Cloud

The emergence of Cloud Computing as a model of service provisioning in distributed systems instigated researchers to explore its pros and cons on executing different large scale scientific applications, i.e., Workflows. One of the most challenging ...
Preemptive Hadoop Jobs Scheduling under a Deadline
SKG '12: Proceedings of the 2012 Eighth International Conference on Semantics, Knowledge and Grids

MapReduce has become the dominant programming model in a cloud-based data processing environment, such as Hadoop. First In First Out (FIFO) is the default job scheduling policy of Hadoop, but it cannot guarantee that the job will be completed by a ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 6

SIGMOD

December 2024

792 pages

EISSN:2836-6573

DOI:10.1145/3709598

Editor:
Divyakant Agrawal
University of California, Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 December 2024

Published in PACMMOD Volume 2, Issue 6

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
NIH

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
148
Total Downloads

Downloads (Last 12 months)148
Downloads (Last 6 weeks)148

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu XGong QChen JLiu QPodhorszki NLiang XKlasky S(2024)Error-controlled Progressive Retrieval of Scientific Data under Derivable Quantities of InterestProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00092(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00092

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents