Gang Liao
Mountain View, California, United States
1K followers
500+ connections
About
I build the systems that make ML work at scale — from silicon to software. At Meta…
Activity
1K followers
Experience
Education
-
University of Maryland
Advisor: Daniel Abadi (https://en.wikipedia.org/wiki/Daniel_Abadi)
Ph.D. Thesis: The Evolution of Cloud Data Architectures: Storage, Compute, and Migration
https://drum.lib.umd.edu/handle/1903/29199 -
-
Publications
-
KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta
ISCA'26
Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and…
Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.
Other authorsSee publication -
SFVInt: Simple, Fast and Generic Variable-Length Integer Decoding using Bit Manipulation Instructions
20th International Workshop on Data Management on New Hardware (DaMoN)
The ubiquity of variable-length integers in data storage and communication necessitates efficient decoding techniques. In this paper, we present SFVInt, a simple and fast approach to decode the prevalent Little Endian Base-128 (LEB128) varints. Our approach, distilled into a mere 500 lines of code, effectively utilizes the Bit Manipulation Instruction Set 2 (BMI2) in modern Intel and AMD processors, achieving significant performance improvement while maintaining simplicity and avoiding…
The ubiquity of variable-length integers in data storage and communication necessitates efficient decoding techniques. In this paper, we present SFVInt, a simple and fast approach to decode the prevalent Little Endian Base-128 (LEB128) varints. Our approach, distilled into a mere 500 lines of code, effectively utilizes the Bit Manipulation Instruction Set 2 (BMI2) in modern Intel and AMD processors, achieving significant performance improvement while maintaining simplicity and avoiding overengineering. SFVInt, with its generic design, effectively processes both 32-bit and 64-bit unsigned integers using a unified code template, marking a significant leap forward in varint decoding efficiency. We thoroughly evaluate SFVInt's performance across various datasets and scenarios, demonstrating that it achieves up to a 2x increase in decoding speed when compared to varint decoding methods used in established frameworks like Facebook Folly and Google Protobuf.
Other authorsSee publication -
Bullion: A Column Store for Machine Learning
15th Conference on Innovative Data Systems Research (CIDR), 2025
The past two decades have witnessed columnar storage revolutionizing data warehousing and analytics. However, the rapid growth of machine learning poses new challenges to this domain. This paper presents Bullion, a columnar storage system tailored for machine learning workloads. Bullion addresses the complexities of data compliance, optimizes the encoding of long sequence sparse features, efficiently manages wide-table projections, and introduces feature quantization in storage. By aligning…
The past two decades have witnessed columnar storage revolutionizing data warehousing and analytics. However, the rapid growth of machine learning poses new challenges to this domain. This paper presents Bullion, a columnar storage system tailored for machine learning workloads. Bullion addresses the complexities of data compliance, optimizes the encoding of long sequence sparse features, efficiently manages wide-table projections, and introduces feature quantization in storage. By aligning with the evolving requirements of ML applications, Bullion extends columnar storage to various scenarios, from advertising and recommendation systems to the expanding realm of Generative AI.
Preliminary experimental results and theoretical analysis demonstrate Bullion’s superior performance in handling the unique demands of machine learning workloads compared to existing columnar storage solutions. Bullion significantly reduces I/O costs for deletion compliance, achieves substantial storage savings with its optimized encoding scheme for sparse features, and drastically improves metadata parsing speed for wide-table projections. These advancements position Bullion as a critical component in the future of machine learning infrastructure, enabling organizations to efficiently manage and process the massive volumes of data required for training and inference in modern AI applications.Other authorsSee publication -
Flock: A Low-Cost Streaming Query Engine on FaaS Platforms
In this paper, we present Flock, a cloud-native streaming query engine that leverages the on-demand elasticity of Function-as-a-Service (FaaS) platforms to perform real-time data analytics. Traditional server-centric deployments often suffer from resource under- or over-provisioning, leading to resource wastage or performance degradation. Flock addresses these issues by providing more fine-grained elasticity that can dynamically match the per-query basis with continuous scaling, and its billing…
In this paper, we present Flock, a cloud-native streaming query engine that leverages the on-demand elasticity of Function-as-a-Service (FaaS) platforms to perform real-time data analytics. Traditional server-centric deployments often suffer from resource under- or over-provisioning, leading to resource wastage or performance degradation. Flock addresses these issues by providing more fine-grained elasticity that can dynamically match the per-query basis with continuous scaling, and its billing methods are more fine-grained with millisecond granularity, making it a low-cost solution for stream processing. Our approach, payload invocation, eliminates the need for external storage services and eliminates the requirement for a query coordinator in the data architecture. Our evaluation shows that Flock significantly outperforms state-of-the-art systems in terms of cost, especially on ARM processors, making it a promising solution for real-time data analytics on FaaS platforms.
Other authorsSee publication -
FileScale: Fast and Elastic Metadata Management for Distributed File Systems
Proceedings of the 14th ACM Symposium on Cloud Computing (SoCC)
Recent work has shown that distributed database systems are a promising solution for scaling metadata management in scalable file systems. This work has shown that systems that store metadata on a single machine, or over a shared-disk abstraction, struggle to scale performance to deployments including billions of files. In contrast, leveraging a scalable, shared-nothing, distributed system for metadata storage can achieve much higher levels of scalabil- ity, without giving up high availability…
Recent work has shown that distributed database systems are a promising solution for scaling metadata management in scalable file systems. This work has shown that systems that store metadata on a single machine, or over a shared-disk abstraction, struggle to scale performance to deployments including billions of files. In contrast, leveraging a scalable, shared-nothing, distributed system for metadata storage can achieve much higher levels of scalabil- ity, without giving up high availability guarantees. However, for low-scale deployments – where metadata can fit in memory on a single machine – these systems that store metadata in a distributed database typically perform an order of magnitude worse than systems that store metadata in memory on a single machine. This has limited the impact of these distributed database approaches, since they are only currently applicable to file systems of extreme scale.
This paper describes FileScale, a three-tier architecture that incorporates a distributed database system as part of a comprehen- sive approach to metadata management in distributed file systems. In contrast to previous approaches, the architecture described in the paper performs comparably to the single-machine architecture at a small scale, while enabling linear scalability as the file system metadata increases.Other authorsSee publication -
BullFrog: Online Schema Evolution via Lazy Evaluation
Proceedings of the 2021 International Conference on Management of Data (SIGMOD 2021)
BullFrog is a relational DBMS that supports single-step schema migrations --- even those that are backwards incompatible --- without downtime, and without need for advanced warning. When a schema migration is submitted, BullFrog initiates a logical switch to the new schema, but physically migrates affected data lazily, as it is accessed by incoming transactions. BullFrog's internal concurrency control algorithms and data structures enable concurrent processing of schema migration operations…
BullFrog is a relational DBMS that supports single-step schema migrations --- even those that are backwards incompatible --- without downtime, and without need for advanced warning. When a schema migration is submitted, BullFrog initiates a logical switch to the new schema, but physically migrates affected data lazily, as it is accessed by incoming transactions. BullFrog's internal concurrency control algorithms and data structures enable concurrent processing of schema migration operations with post-migration transactions while ensuring exactly-once migration of all old data into the physical layout required by the new schema. BullFrog is implemented as an open source extension to PostgreSQL. Experiments using this prototype over a TPC-C based workload (supplemented to include schema migrations) show that BullFrog can achieve zero-downtime migration to non-trivial new schemas with near-invisible impact on transaction throughput and latency.
Other authorsSee publication
Other similar profiles
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top contentOthers named Gang Liao
188 others named Gang Liao are on LinkedIn
See others named Gang Liao