skip to main content
research-article
Open access

Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

Published: 20 December 2024 Publication History

Abstract

Data analytics tasks are often formulated as data workflows represented as directed acyclic graphs (DAGs) of operators. The recent trend of adopting machine learning (ML) techniques in workflows results in increasingly complicated DAGs with many operators and edges. Compared to the operator-at-a-time execution paradigm, pipelined execution has benefits of reducing the materialization cost of intermediate results and allowing operators to produce results early, which are critical in iterative analysis on large data volumes. Correctly scheduling a workflow DAG for pipelined execution is non-trivial due to the richer semantics of operators and the increasing complexity of DAGs. Several existing data systems adopt simple heuristics to solve the problem without considering costs such as materialization sizes. In this paper, we systematically study the problem of scheduling a workflow DAG for pipelined execution, and develop a novel cost-based optimizer called Pasta for generating a high-quality schedule. The Pasta optimizer is not only general and applicable to a wide variety of cost functions, but also capable of utilizing properties inherent in a broad class of cost functions to improve its performance significantly. We conducted a thorough evaluation of developed techniques on real-world workflows and show the efficiency and efficacy of these solutions.

References

[1]
Alteryx 2024. AI Analytics Platform - Alteryx, https://www.alteryx.com/.
[2]
Apache Flink 2024. Apache Flink® - Stateful Computations over Data Streams | Apache Flink, https://flink.apache.org.
[3]
Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11--16, 2011, Hannover, Germany, Serge Abiteboul, Klemens Böhm, Christoph Koch, and Kian-Lee Tan (Eds.). IEEE Computer Society, 1151--1162. https://doi.org/10.1109/ICDE.2011.5767921
[4]
Luc Bouganim, Daniela Florescu, and Patrick Valduriez. 1996. Dynamic Load Balancing in Hierarchical Parallel Database Systems. In VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3--6, 1996, Mumbai (Bombay), India, T. M. Vijayaraman, Alejandro P. Buchmann, C. Mohan, and Nandlal L. Sarda (Eds.). Morgan Kaufmann, 436--447. http://www.vldb.org/conf/1996/P436.PDF
[5]
Nilesh N. Dalvi, Sumit K. Sanghai, Prasan Roy, and S. Sudarshan. 2001. Pipelining in Multi-Query Optimization. In Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 21--23, 2001, Santa Barbara, California, USA, Peter Buneman (Ed.). ACM. https://doi.org/10.1145/375551.375561
[6]
Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Zoi Kaoudi, Tilmann Rabl, and Volker Markl. 2022. Materialization and Reuse Optimizations for Production Data Science Pipelines. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022. ACM, 1962--1976. https://doi.org/10.1145/3514221.3526186
[7]
Docker Swarm 2024. Swarm mode | Docker Docs, https://docs.docker.com/engine/swarm/.
[8]
Yannis Foufoulas and Alkis Simitsis. 2023. User-Defined Functions in Modern Data Engines. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3--7, 2023. IEEE, 3593--3598. https://doi.org/10.1109/ICDE55515.2023.00276
[9]
Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Lorand Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2021. Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021. IEEE, 1428--1439. https://doi.org/10.1109/ICDE51399.2021.00127
[10]
Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. 2016. GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2--4, 2016, Kimberly Keeton and Timothy Roscoe (Eds.). USENIX Association, 81--97. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/grandl_graphene
[11]
Waqar Hasan and Rajeev Motwani. 1994. Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism. In VLDB'94, Proceedings of 20th International Conference on Very Large Data Bases, September 12--15, 1994, Santiago de Chile, Chile, Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo (Eds.). Morgan Kaufmann, 36--47. http://www.vldb.org/conf/1994/P036.PDF
[12]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy H. Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011, David G. Andersen and Sylvia Ratnasamy (Eds.). USENIX Association. https://www.usenix.org/conference/nsdi11/mesos-platform-fine-grained-resource-sharing-data-center
[13]
KNIME 2024. Open for Innovation | KNIME, https://www.knime.com/.
[14]
KNIME Community Workflows 2024. Workflows | KNIME Community Hub, https://hub.knime.com/search?type=Workflow.
[15]
Kubernetes 2024. Kubernetes, https://kubernetes.io/.
[16]
Avinash Kumar, Zuozhi Wang, Shengquan Ni, and Chen Li. 2020. Amber: A Debuggable Dataflow System Based on the Actor Model. Proc. VLDB Endow. 13, 5 (2020), 740--753. https://doi.org/10.14778/3377369.3377381
[17]
V. S. Anil Kumar, Madhav V. Marathe, Srinivasan Parthasarathy, and Aravind Srinivasan. 2009. Scheduling on Unrelated Machines under Tree-Like Precedence Constraints. Algorithmica 55, 1 (2009), 205--226. https://doi.org/10.1007/S00453-007--9004-Y
[18]
Xiaozhen Liu, Zuozhi Wang, Shengquan Ni, Sadeem Alsudais, Yicong Huang, Avinash Kumar, and Chen Li. 2022. Demonstration of Collaborative and Interactive Workflow-Based Data Analytics in Texera. Proc. VLDB Endow. 15, 12 (2022), 3738--3741. https://www.vldb.org/pvldb/vol15/p3738-liu.pdf
[19]
Ming-Ling Lo, Ming-Syan Chen, Chinya V. Ravishankar, and Philip S. Yu. 1993. On Optimal Processor Allocation to Support Pipelined Hash Joins. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26--28, 1993, Peter Buneman and Sushil Jajodia (Eds.). ACM Press, 69--78. https://doi.org/10.1145/170035.170053
[20]
Pipelined Regions in Apache Flink 2020. Improvements in task scheduling for batch workloads in Apache Flink 1.12, https://flink.apache.org/2020/12/02/improvements-in-task-scheduling-for-batch-workloads-in-apache-flink-1.12/.
[21]
Raghu Ramakrishnan and Johannes Gehrke. 2002. Database Management Systems. WCB/McGraw-Hill.
[22]
RapidMiner 2024. Data Analytics and AI Platform | Altair RapidMiner, https://rapidminer.com/.
[23]
Vladislav Shkapenyuk, Ryan Williams, Stavros Harizopoulos, and Anastassia Ailamaki. 2005. Deadlock resolution in pipelined query graphs. Carnegie Mellon University Technical Report.
[24]
Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, and Benjamin Recht. 2017. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19--22, 2017. IEEE Computer Society, 535--546. https://doi.org/10.1109/ICDE.2017.109
[25]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin C. Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In ACM Symposium on Cloud Computing, SOCC '13, Santa Clara, CA, USA, October 1--3, 2013, Guy M. Lohman (Ed.). ACM, 5:1--5:16. https://doi.org/10.1145/2523616.2523633
[26]
Zuozhi Wang, Yicong Huang, Shengquan Ni, Avinash Kumar, Sadeem Alsudais, Xiaozhen Liu, Xinyuan Lin, Yunyan Ding, and Chen Li. 2024. Texera: A System for Collaborative and Interactive Data Analytics Using Workflows. Proc. VLDB Endow. 17, 11 (2024), 3580--3588. https://www.vldb.org/pvldb/vol17/p3580-wang.pdf
[27]
Zuozhi Wang and Chen Li. 2023. Building a Collaborative Data Analytics System: Opportunities and Challenges. Proc.VLDB Endow. 16, 12 (2023), 3898--3901. https://doi.org/10.14778/3611540.3611580
[28]
Yinggen Xu, Liu Liu, and Zhijun Ding. 2020. DAG-Aware Joint Task Scheduling and Cache Management in Spark Clusters. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, May 18--22, 2020. IEEE, 378--387. https://doi.org/10.1109/IPDPS47924.2020.00047
[29]
Zhihui Yang, Zuozhi Wang, Yicong Huang, Yao Lu, Chen Li, and X. Sean Wang. 2022. Optimizing Machine Learning Inference Queries with Correlative Proxy Models. Proc. VLDB Endow. 15, 10 (2022), 2032--2044. https://www.vldb.org/pvldb/vol15/p2032-yang.pdf
[30]
Shaoyi Yin, Abdelkader Hameurlain, and Franck Morvan. 2015. Robust Query Optimization Methods With Respect to Estimation Errors: A Survey. SIGMOD Rec. 44, 3 (2015), 25--36. https://doi.org/10.1145/2854006.2854012
[31]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25--27, 2012. 15--28.

Cited By

View all
  • (2024)Error-controlled Progressive Retrieval of Scientific Data under Derivable Quantities of InterestProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00092(1-16)Online publication date: 17-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 6
SIGMOD
December 2024
792 pages
EISSN:2836-6573
DOI:10.1145/3709598
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 December 2024
Published in PACMMOD Volume 2, Issue 6

Author Tags

  1. data engine
  2. physical plan
  3. pipelined execution
  4. scheduler
  5. workflow

Qualifiers

  • Research-article

Funding Sources

  • NSF
  • NIH

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)148
  • Downloads (Last 6 weeks)148
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Error-controlled Progressive Retrieval of Scientific Data under Derivable Quantities of InterestProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00092(1-16)Online publication date: 17-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media