[{"content":" ","date":"14 December 2025","externalUrl":null,"permalink":"/blog/","section":"Blog","summary":"","title":"Blog","type":"blog"},{"content":"This post gives an overview of our recent paper preprint, Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems, which I authored along with Luka Grbcic, Samuel Williams, and Costin Iancu.\nWe introduce a logical framework for comparing multi-agent PyTorch optimization systems, along with our implementations within it, which we collectively call PyTorch Inference Kernel Evolution (PIKE). The PIKE code has been made open-source on GitHub. We explore the configuration space with the help of OpenEvolve, and we manage to outperform PyTorch\u0026rsquo;s eager execution mode by up to 2.88×!\nGPU Optimization Problem # New generations of AI datacenter GPUs are now being rolled out on an annual basis, forcing software support to play a constant game of catch-up. To make the problem worse, new AI/ML model techniques are being proposed constantly. This leads to a set of workloads that library/compiler engineers are unlikely to optimize for, unless an idea gains a lot of traction from the community.\nWithout excellent library/compiler support, demonstrating good performance for a new idea could mean tons of manual GPU programming. Thus, it\u0026rsquo;s getting harder for AI/ML researchers to challenge conventional wisdom.\nTo name one example, in December 2022, H3 showed the viability of replacing the standard Transformer architecture in language modeling with a hybrid architecture that integrates state space model (SSM) layers. However, achieving competitive performance in their paper required the authors to develop complex, novel GPU kernels. Adoption of the idea into modern LLM inference engines took 3 years, mainly because of GPU memory management challenges [vLLM announcement, SGLang announcement].\nCan we find a way to eliminate manual GPU performance engineering from the equation using LLMs, and what would such a system look like?\nOur Solution # Our target GPU was an NVIDIA H100. We used a modified version of KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch. We target these key levels from the suite:\nLevel 3 (Curated blocks from older models): RNNs, Attention, Convolutions Level 5 (Frontier workloads from SOTA 2024 models): DeepSeek-V3, Llama 3, Mamba-2, Hunyuan Video Many prior works have shown effective LLM-based optimization systems that target KernelBench tasks, but the dynamics of multi-agent systems for this performance engineering problem remain unexplored. We developed a logical framework to analyze these systems and fill the gap!\nFigure 1: Simplified visual of the problem and our setup\nWe built a robust, performant evaluator that gets PyTorch/CUDA/Triton code, checks for correctness, runs performance tests, then returns back errors and metrics. The idea is that we can plug in an LLM-based system that iteratively improves the performance of the original PyTorch code by querying the evaluator in a loop, then eventually returns the fastest valid solution.\nUsing this setup, we developed a logical framework where evolutionary, LLM-based multi-agent systems can operate. Then, we identified some key hyperparameters of these evolutionary strategies that impact the optimization process, such as the explore/exploit ratio, islands, mutation/crossover, and LLM-based error fixing.\nPIKE-B (Branching Search) # We initially developed PIKE-B, a hand-written, exploit-heavy, evolutionary strategy that operates in optimization rounds.\nFigure 2: PIKE-B branching search strategy diagram\nPIKE-B spends a limited number of LLM queries on fixing errors using an error fixing agent (EFA). After a cutoff point, the best solutions from this round are ranked by runtime, then the top-k solutions are used as seeds for the next set of LLM queries.\nWe call the strategy \u0026ldquo;exploit-heavy\u0026rdquo; because it concentrates effort on the best existing solutions. As it turns out, we find this approach to work surprisingly well in our results. Prior works have proposed similar approaches, but we are especially interested in why it works so well.\nPIKE-O (OpenEvolve) # To explore the hyperparameter space and understand what makes PIKE-B an effective strategy, we used OpenEvolve, an open-source framework for LLM-based code evolution inspired by AlphaEvolve. OpenEvolve made it simple to tune hyperparameters of the optimization process, and it fits cleanly into our logical framework.\nImportantly, OpenEvolve contains:\nisland-based evolution explore/exploit ratio settings solution library configuration settings We modified OpenEvolve to incorporate LLM-based error fixing. We then applied a bunch of OpenEvolve configurations to KernelBench to better understand agent framework behavior.\nResults # We measured the performance of PIKE solutions against the original models (with PyTorch eager). We measured cost in 2 ways: LLM query count per task, and cost in $ per task for LLM queries.\nSpeedup Trajectories # For each PIKE implementation, we used the best solution generated within the current budget per task. We did this up to a budget of 300 queries per task, or around $25/task on Level 3. Gemini 2.5 Pro was used everywhere, except for the cheap error fixing agent (EFA), where we used Gemini 2.5 Flash. Keep in mind, EFA uses a portion of the LLM budget too.\nFigure 3: Geomean speedups over PyTorch eager for our filtered Level 3, varying budget per task\nThe default PIKE-O approach is quite explore-heavy. We ran a series of ablations to make it functionally equivalent to PIKE-B, shown in PIKE-O (mut,npar,1isl,EO,SL).\nInterestingly, approaches without EFA do poorly, relative to those with EFA. Cheap EFA is cost effective here, and exploit-heavy strategies offer the best performance gains.\nOverall Speedups and Ablations # We ran ablations to dig deeper into multi-agent behavior for the task at hand. Note: we don\u0026rsquo;t evaluate all combinatorial versions, since end-to-end runs are expensive and time-consuming.\nFigure 4: Speedups over PyTorch eager for our filtered Level 3 are shown above. The full bar shows a budget of 300 LLM queries per task, and the dashed lines show $25/task.\nThe series of PIKE-O ablations shifts PIKE-O from being an explore-heavy strategy towards being an exploit-heavy strategy, with PIKE-O (mut,npar,1isl,EO,SL) being virtually equivalent to PIKE-B without IBA (initial brainstorming agent). As we should expect, this PIKE-O variant displays very similar performance to PIKE-B and PIKE-B (no IBA).\nThe \u0026ldquo;1isl\u0026rdquo; parameter changes PIKE-O from having 3 islands to just 1 island, and a notable change in performance shows up at that ablation step. Reducing the evolutionary exploration to one island makes this an exploit-focused process, and we can see this leads to a big performance boost. Clearly, worrying about early convergence is not worth it in our budget range!\nFigure 5: Speedups over PyTorch eager for Level 5 are shown above for a limited set of our implementations. The full bar shows a budget of 300 LLM queries per task, and the dashed lines show $50/task.\nSimilar to our Level 3 results, PIKE implementations perform much better than PyTorch eager and other competitors, including ~2× over torch.compile! The difference in PIKE-O variants is more subtle on these challenging tasks.\nKey Takeaways # Based on results shown here and in the paper:\nit is worth it to spend some of your budget on error fixing agents using a cheap error fixing LLM can be cost effective, alongside a more powerful model for actual code optimization exploit-heavy strategies (e.g. reducing island count to 1) are favorable under the budget we explored performance correlates with the granularity of optimization steps (see the paper for more details) We\u0026rsquo;ve demonstrated that LLM-based multi-agent systems show great promise in mitigating the GPU optimization dilemma. More importantly, we\u0026rsquo;ve learned a bit about how to characterize multi-agent systems, and why certain configurations perform better than others! If you found our work useful, consider citing the paper:\n@misc{nagaitsev2025pike, title={Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems}, author={Kirill Nagaitsev and Luka Grbcic and Samuel Williams and Costin Iancu}, year={2025}, eprint={2511.16964}, archivePrefix={arXiv}, primaryClass={cs.MA}, url={https://arxiv.org/abs/2511.16964}, } Note: a similar version of this article has been cross-posted on the OpenEvolve blog\n","date":"14 December 2025","externalUrl":null,"permalink":"/blog/llm-agents-outperform-pytorch-compiler/","section":"Blog","summary":"","title":"How LLM-Based Agents Outperform the PyTorch Compiler by 2×","type":"blog"},{"content":"I am a computer science Ph.D. candidate and DOE CSGF fellow at Northwestern University, advised by Peter Dinda as part of the Prescience Lab. My research interests mainly lie in compiler support for parallel computing and machine learning. Recently, my work has focused on understanding and improving LLM-based systems for performance engineering tasks (e.g. PyTorch inference optimization).\nI received my undergraduate degree in computer science from University of Chicago, where I was advised by Ian Foster and Kyle Chard as part of Globus Labs. While there, I was involved in FaaS (Function as a Service) research, and I helped develop the Globus Compute platform (previously known as funcX).\nI\u0026rsquo;m a big advocate of open source software, with substantial contributions to the widely used webpack-dev-server (10M+ weekly downloads on npm). I started contributing as part of the webpack GSoC (Google Summer of Code) program in 2019, and later worked as a freelance webpack-dev-server maintainer.\nIn my free time, I enjoy game development, and have published a number of popular multiplayer games [Oldwest.io, Cavegame.io, MineRoyale.io]. I\u0026rsquo;m also generally interested in operating systems, networking, and system security outside of my normal research. You can read more about my work and other miscellaneous interests on my blog!\nPublications # Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems Kirill Nagaitsev, Luka Grbcic, Samuel Williams, Costin Iancu 2025-11-21 Preprint Eliminating Hardware Interrupts with Dispersed Interrupt Polling Kirill Nagaitsev, Kevin McAfee, Kevin Hayes, Justin Dong, Nadharm Dhiantravan, Peter Dinda 2025-09 Tech Report Village: From High Level Parallelism to High Performance Kirill Nagaitsev, Griffin Dube, Karl Hallsby, Peizhi Liu, Qinze Jiang, Lucas Myers, David Krasowska, Alexander Butler, Ruiqi Xu, Peter Dinda 2025-08 Tech Report funcX: Federated Function as a Service for Science Zhuozhao Li, Ryan Chard, Yadu Babuji, Ben Galewsky, Tyler Skluzacek, Kirill Nagaitsev, Anna Woodard, Ben Blaiszik, Josh Bryan, Daniel S. Katz, Ian Foster, Kyle Chard 2022-09-22 IEEE TPDS ","date":"14 December 2025","externalUrl":null,"permalink":"/","section":"Kirill Nagaitsev","summary":"","title":"Kirill Nagaitsev","type":"page"},{"content":"","date":"1 October 2023","externalUrl":null,"permalink":"/projects/","section":"","summary":"","title":"","type":"projects"},{"content":"","date":"1 October 2023","externalUrl":"https://oldwest.io","permalink":"/projects/oldwest-io/","section":"","summary":"","title":"Oldwest.io","type":"projects"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"}]