CS854-Fall24: Advanced Topics in Computer Systems --- Model Serving Systems for GenAI

Administrivia

Lectures: DC2568, Friday: 1:30 PM – 4:20 PM

Instructor

Email: honzhang at uwaterloo dot ca (to submit presentation slides and paper summaries)

Office: DC3530 (office hours by appointment)

Course Description

Generative AI (GenAI) applications are revolutionizing the world. The latest GenAI models such as GPT-4 have achieved unprecedented performance in various tasks such as code generation, text classification, and problem reasoning. However, serving GenAI applications, i.e., deploying trained GenAI models on a compute cluster and conducting model inference for incoming user requests, presents challenges in systems design.

This seminar-based course will introduce you to the key concepts and the state-of-the-art in model serving systems for emerging Generative AI (GenAI) and encourage you to think about either building new tools or applying an existing one in your own research. The course will cover various important topics for serving systems for GenAI, including efficient batching, memory/cache management, request scheduling and load balancing, and compound AI systems such as Retrieval-Augmented Generation. Note that this course is NOT focused on AI methods. Instead, we will focus on how to build efficient serving systems for existing AI methods.

Prerequisites

Students are expected to have strong programming skills and have completed at least one undergraduate-level systems-related course, such as Operating Systems, Databases, Distributed Systems, or Computer Networks. While an undergraduate course in Machine Learning or Artificial Intelligence would be beneficial, it is not a requirement.

Textbook and Exams

This course has no textbooks or exams. We will read recent papers to understand trends and important topics in serving systems for GenAI.

Tentative Schedule and Reading List

This is an evolving list and the schedule is subject to changes.

Date	Readings	Presenter	Reviewer
Sept 6	Introduction	Hong
	How to Read a Paper
	How to Give a Bad Talk
	Writing Reviews for Systems Conferences
	LLM Inference Serving: Survey of Recent Advances and Opportunities
	Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
	The Datacenter as a Computer (Chapters 1 and 2, optional)
	The Llama 3 Herd of Models (optional)
Sept 13	*Serving Systems for GenAI vs.* Serving Systems for traditional DNN**
	The Illustrated Transformer (Required)	Xiaodian
	The Illustrated GPT2 (optional)
	Attention is All You Need (optional)
	Serving DNNs like Clockwork: Performance Predictability from the Bottom Up (Required)	Hong
	Orca: A Distributed Serving System for Transformer-Based Generative Models (Required)	Hong
Sept 20	Memory Management
	Efficient Memory Management for Large Language Model Serving with PagedAttention (Required)	Prashanth	Sairaj
	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention (Required)	Wenhao	Xiaodian
	FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (Required)	Hongzhou
	LLM in a flash: Efficient Large Language Model Inference with Limited Memory (optional)
Sep 27	*Prefill vs.* Decode**
	Distserve: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (required)	Gaurav	Khushee
	Splitwise: Efficient generative LLM inference using phase splitting (Optional)
	SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (required)	Ronak	Eric
	Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (Required)	Michael
	Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (Optional)
Oct 4	Parallelism
	AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (Required)	Kerem	Amir
	Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference (Required)	Dongfu	Bishwajit
	LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism (Required)	Benjamin	Wenhao
Oct 11	Scheduling
	Fast Distributed Inference Serving for Large Language Models (Required)	Khushee	Ronak, Yixuan Wang
	Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline (Optional)
	Llumnix: Dynamic Scheduling for Large Language Model Serving (Required)	Eric	Michael
	Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services (Required)	Rui Felipe	Gaurav
	ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference(Optional)
	Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving (Optional)
Oct 25	Faster Decoding + Project Proposal
	SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification (Required)	Sairaj	Hongzhou
	Break the Sequential Dependency of LLM Inference Using Lookahead Decoding (Optional)
Nov 1	Compound AI Systems
	Parrot: Efficient Serving of LLM-based Applications with Semantic Variable (Required)	Amir	Rui Felipe
	Teola: Towards End-to-End Optimization of LLM-based Applications (Required)	Hongzhou	Dongfu
	ALTO: An Efficient Network Orchestrator for Compound AI Systems + Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution (Optional)
	INFERCEPT: Efficient Intercept Support for Augmented Large Language Model Inference (Required)	Ronak	Benjamin
	The Shift from Models to Compound AI Systems (Background)
Nov 8	Invited talks
	Punica: Multi-Tenant LoRA Serving (Invited Talk)	Lequn Chen
	dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving (Optional)
	S-LoRA: Serving Thousands of Concurrent LoRA Adapters (Optional)
	LoRA: Low-Rank Adaptation of Large Language Models (Optional)
	Serving and evaluating LLMs in the wild (Invited Talk)	Hao Zhang
Nov 15	Serving with Retrieval-Augmented Generation
	Prompt Cache: Modular Attention Reuse for Low-Latency Inference (Required)	Xiaodian	Prashanth
	RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (Required)	Khushee	Kerem
	CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion (Required)	Prashanth	Michael, Rui Felipe
Nov 22	Serving in the Wild
	SpotServe: Serving Generative Large Language Models on Preemptible Instances (Optional)
	ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models (Required)	Eric	Gaurav, Amir
	M´elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity (Optional)
	Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs (Optional)
	Stateful Large Language Model Serving with Pensieve (Optional)
	Other Important Topics (Cache management, Multi-tenancy, MoE, Fairness, Serving Simulator Design, etc.)
	InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (Required)	Dongfu	Benjamin, Sairaj
	CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (Optional)
	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention (Optional)
	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache (Optional)
	Fairness in Serving Large Language Models (Required)	Wenhao	Kerem
	DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale (Optional)
	MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving (Optional)
	Mixture of LoRA Experts (Optional)
	Vidur: A Large-Scale Simulation Framework For LLM Inference (Optional)
	LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale (Optional)
Nov 29	Final Project Presentations

Policies

Participation

Before Each Lecture: Each lecture will have three required readings that everyone must read and will be presented in the class. For some lectures, there will be some background readings providing necessary knowledge for the corresponding topic. There are also some optional related reading(s) under the theme. They are optional for the class.

During Lectures: Active participation is crucial for both your own understanding and to improve the overall quality of the course. You are expected to attend all lectures, and more importantly, participate in class discussions. Not everyone must have add something every day, but it is expected that everyone has something to share over the semester.

After Lectures: Participation also involves contributing to discussions on Piazza. The group responsible for the summary should initiate the (remaining) discussion, and the rest of the members are encouraged to participate.

Student Lectures

The course will be structured as a seminar. Each student will be assigned at least one paper to present over the semester. Only the students assigned will present in each class. Since there are three required readings, there will be three student presenters assigned each day. Presentations for each paper should last around 35-45 minutes without interruption. However, presenters should expect questions and interruptions throughout. The presenters are free to come up with separate presentations or work together to merge their presentations. However, the goal of your presentation must be the following:

Provide necessary background and motivate the problem. Note that in principle, each lecture has a “theme” such that the papers are connected in some way. For instance, perhaps they are trying to solve the same problem using different approaches, or maybe one is building on top of the other. Your presentation should try to make this connection.
Clearly describe the problem the paper solves and the corresponding challenges.
Present the high-level idea, approach, and/or insight (using examples, whenever appropriate) in the required reading.
Discuss technical details so that one can understand key details without carefully reading.
Explain the differences between related works.
Identify strengths and weaknesses of the required reading and propose directions of future research.

**Note: ** The slides for a presentation must be emailed to the instructor at least 24 hours before the corresponding class (in *.pptx format). Do not just re-use slides provided by the paper authors. You may borrow, with attribution, figures, and animations, but your slides should be created independently.

Post-Presentation Panel Discussion

To foster a deeper understanding of the papers and to encourage critical thinking, each lecture (from Lecture 3) will be followed by a panel discussion. This discussion will feature three distinct roles played by different student groups, simulating an interactive and dynamic scholarly exchange.

Roles and Responsibilities

The Authors

The students who present will play the role of the authors.
Responsibility: As authors, you are expected to defend your paper against critiques, answer questions, and discuss how you might improve or extend your research in the future, akin to writing a rebuttal during the peer-review process.

The Reviewers

Additional students will be assigned to be the “reviewers” for the required papers in each class.
Responsibility: Each reviewer will write a detailed review for the paper assigned. The review must be 3-4 pages long. Reviewers critically assess the paper, posing challenging questions and highlighting potential weaknesses or areas for further investigation. Your goal is to engage in a constructive critique of the paper, simulating a peer review scenario.
You must use this template for your review. You should submit your reviews before the lecture. Late submissions will not be entertained.

Rest of the Class

Responsibility:
- During the panel discussions, feel free to actively ask questions and engage in the dialogue.

Project

You will have to complete substantive work on an instructor-approved problem and have original contributions. Surveys are not permitted as projects; instead, each project must contain a survey of background and related work.

You must meet the following milestones (unless otherwise specified in future announcements) to ensure a high-quality project at the end of the semester:

Form a group of 2-4 members by Oct 1. After this date, we will form groups from the remaining students.
Turn in a 2-page draft proposal (including references) by Oct 23. Remember to include the names and email addresses of the group members.
Each group must present the proposal during class hours on Oct 25.
Each group must turn in a 6- to 8-page final report via email on or before Dec 9. The report must be submitted as a PDF file, with formatting similar to that of the papers you've read in the class.

Tentative Grading

	Weight
Paper Presentation	25%
Paper Review	15%
Participation	10%
Project Proposal	5%
Project Presentations	15%
Project Report	30%

Acknowledgement

The course is heavily inspired by CSE 585: Advanced Scalable Systems for Generative AI (UMich), CS 598: Systems for Generative AI (UIUC), CS8803: Systems for AI - LLMs(Gatech), and 294-162 Machine Learning Systems (UC Berkeley).

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
README.md		README.md
review_template.txt		review_template.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS854-Fall24: Advanced Topics in Computer Systems --- Model Serving Systems for GenAI

Administrivia

Instructor

Course Description

Prerequisites

Textbook and Exams

Tentative Schedule and Reading List

Policies

Participation

Student Lectures

Post-Presentation Panel Discussion

Roles and Responsibilities

Project

Tentative Grading

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CS854-Fall24: Advanced Topics in Computer Systems --- Model Serving Systems for GenAI

Administrivia

Instructor

Course Description

Prerequisites

Textbook and Exams

Tentative Schedule and Reading List

Policies

Participation

Student Lectures

Post-Presentation Panel Discussion

Roles and Responsibilities

Project

Tentative Grading

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages