BentoML (Acquired by Modular)’s cover photo
BentoML (Acquired by Modular)

BentoML (Acquired by Modular)

Software Development

San Francisco, California 10,672 followers

🍱 Inference Platform built for speed and control. Acquired by Modular.

About us

BentoML is an enterprise-grade Inference platform for deploying and managing AI models at scale. It offers full control without the complexity, allowing teams to serve any model including LLMs, embeddings, and agentic pipelines across VPC, on-prem, or hybrid environments with tailored optimization, advanced orchestration, and fine-grained performance tuning. From prototype to production, BentoML covers the full inference lifecycle with instant model deployments, elastic autoscaling, built-in observability, compliance-ready features, and mission-critical reliability, freeing your team to deliver AI that drives real business outcomes faster.

Website
https://www.bentoml.com
Industry
Software Development
Company size
51-200 employees
Headquarters
San Francisco, California
Type
Privately Held
Founded
2019
Specialties
Model Serving, Model Inference, Inference Platform, Compound AI Systems, Multimodality, AI Inference, LLM Inference, LLM Applications, MLOps, LLMOps, and InferenceOps

Products

Locations

  • Primary

    650 California St

    6 fl

    San Francisco, California 94108, US

    Get directions

Employees at BentoML (Acquired by Modular)

Updates

  • Gemma 4 is already live on Modular Cloud ⚡ Day-zero support, fastest performance across both NVIDIA and AMD, powered by MAX. One unified system, from kernel to cloud. No waiting on upstream kernels, no fragmentation, just consistent performance and portability. If you’re evaluating Gemma 4 for production, this is worth a look.

    View organization page for Modular

    26,667 followers

    Gemma 4 is live on Modular Cloud, day zero, with the fastest performance on both NVIDIA and AMD. Our MAX inference framework delivers 15% higher throughput vs. vLLM on B200, and we’re the only inference provider to ship Google DeepMind's Gemma 4 on a framework we built ourselves. Both flagship models available now: → Gemma 4 31B: dense, 256K context, built for deep reasoning → Gemma 4 26B A4B: MoE, 26B params, 4B active per forward pass Both natively multimodal: text, images, and video. Modular Cloud runs on MAX, our inference framework that unifies GPU kernels, graph compilation, and high-performance serving in a single hardware-agnostic stack. When a new architecture drops, we're not waiting on upstream support or porting hand-tuned kernels. We went from new weights to SOTA performance on two hardware platforms in days. No other inference provider is shipping Google DeepMind's Gemma 4 on a framework they built themselves, and we're the only team serving it across multiple GPU stacks. NVIDIA B200 or AMD MI355X. Same stack, same API. Pick the price-performance point that fits your workload on Modular Cloud. Try Google's Gemma 4: https://lnkd.in/gxGVP4MA What model are you trying first? #Gemma4

    • No alternative text description for this image
  • Image generation becomes real-time and radically cheaper. MAX now brings FLUX.2 image generation directly into the same stack used for text and audio. ✅ ~4× faster than torch.compile (Diffusers) ✅ Sub-second generation at production quality ✅ Up to 5.5× TCO advantage on AMD MI355X ✅ 99% cheaper per image than Nano Banana Pro This isn’t just a speedup. It fundamentally changes what’s possible: real-time image workflows, interactive UX, and cost structures that make large-scale generation viable.

    Generate images in less than 1 second. 99% cheaper than NanoBanna. 🚀 😱 Super excited by the latest 26.2 release that ships FLUX.2 image generation with a 4.1x speedup over torch.compile on NVIDIA Blackwell - translating to a 5.5x TCO advantage with AMD MI355X. Also an updated website and Cloud is coming 🔥 🤫 We also dropped AI coding skills that plug directly into Claude, Cursor, and Codex for writing portable GPU kernels across any hardware target and a lot more! Checkout the comments ⬇️

  • BentoML (Acquired by Modular) reposted this

    We just acquired BentoML 🚀 When we started Modular, our thesis was simple: the AI infrastructure stack is broken. We built Mojo and MAX to fix the bottom - hardware-aware optimization that's portable across NVIDIA, AMD, and beyond. But optimization without production serving is like building a Ferrari engine and shipping it without a chassis. We've been working with BentoML for a while and we were so impressed and excited about what Chaoyu Yang, Sean Sheng and the entire team has achieved. They share our vision for a hypervisor for AI compute - and they've proven they can roll production serving at scale. BentoML has: 🔹 10,000+ organizations trust BentoML to deploy models at scale 🔹 50+ Fortune 500 companies run production inference through it 🔹 Apache 2.0 open source - battle-tested in real infrastructure, not just demos Here's why this combination is different: 🔸 Full-stack control. When you own optimization + runtime + serving, you can make architectural decisions no point solution can. This isn't an integration - it's a new category. 🔸 Real portability. Deploy across NVIDIA and AMD without rebuilding your serving stack. No more hardware-specific deployment code. 🔸 Enterprise BYOC done right. Your cloud. Your VPC. Your security posture. Our optimization. For existing BentoML users - nothing breaks. The open source project continues under Apache 2.0. Same docs, same community, same contribution process. We're hiring aggressively to build Modular Cloud. If building at-scale AI infrastructure is your thing, come talk to us. Announcement in the comments 👇🏼 #AI #MLOps #Infrastructure #OpenSource #Modular #BentoML #AIInfrastructure

  • 🚀 BentoML is joining Modular. Together, we’re making high-performance inference easier to run in production. It’s fast, portable, and free from hardware lock-in. What this partnership enables: ✅ Optimize and serve models in a single workflow ✅ Run across NVIDIA, AMD, and future accelerators without rewrites ✅ Deploy in your own infrastructure with modern performance ✅ Tailor performance to your use case faster with one unified stack Learn more 👉 https://lnkd.in/gGDbwSkJ

  • 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲. 𝗗𝗼𝗻’𝘁 𝗷𝘂𝘀𝘁 𝘀𝗰𝗮𝗹𝗲 𝗚𝗣𝗨𝘀. As workloads scale, inference becomes the dominant bottleneck. TTFT creeps up. Decode slows on long contexts. KV cache pressure caps concurrency. GPU spend rises faster than throughput. Adding more hardware isn’t the way out. The real leverage comes from targeted inference optimization. We just published a guide breaking down 6 production-tested strategies teams are using to stabilize latency, increase throughput, and cut cost per token: - Smarter batching (static, dynamic, and continuous) - Speculative decoding & PD disaggregation - KV cache optimizations like prefix caching and cache-aware routing - Attention and memory optimizations to reduce fragmentation - Data, tensor, pipeline, expert, and hybrid parallelism - Offline batch inference for non-interactive workloads Most production systems don’t have a single bottleneck. TTFT, decode speed, cache pressure, and parallelism interact and the highest-impact optimizations depend on where your system is actually breaking. Learn more: https://lnkd.in/gU5XjWDJ

  • At scale, LLM inference is a 𝗺𝘂𝗹𝘁𝗶-𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗽𝗿𝗼𝗯𝗹𝗲𝗺, 𝗻𝗼𝘁 𝗮 𝘀𝗶𝗻𝗴𝗹𝗲-𝗻𝘂𝗺𝗯𝗲𝗿 𝗰𝗼𝗻𝘁𝗲𝘀𝘁. However, many teams still evaluate LLMs using two headline numbers: throughput and cost per million tokens. They’re simple, comparable, and almost always misleading. - A model that looks fast in a benchmark can stall under real concurrency. - A setup that looks cheap can drive 2–3× overspend once latency SLOs matter. - A configuration tuned for synthetic prompts can quietly degrade quality in real workflows. What actually helps determine success in production: - TTFT, ITL and P99 latency, not just throughput - Concurrency behavior and scheduling, not batch-optimized demos - KV cache pressure and memory layout, not warm-cache benchmarks - Precision trade-offs, where speed gains can quietly erode reasoning quality - End-to-end pipeline behavior, not isolated model tests We just published a deep dive on how enterprise teams should evaluate LLM inference and systematically balance speed, cost, and quality 👉 https://lnkd.in/g5W7WaR7

  • 𝗔𝗜 𝗶𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗶𝘀 𝘀𝗵𝗶𝗳𝘁𝗶𝗻𝗴 𝗳𝗮𝘀𝘁 𝗮𝗻𝗱 𝗺𝗼𝘀𝘁 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝘀𝘁𝗮𝗰𝗸𝘀 𝗮𝗿𝗲𝗻’𝘁 𝗸𝗲𝗲𝗽𝗶𝗻𝗴 𝘂𝗽. Modern inference means more than just putting a model behind an endpoint. It’s also about how well your infrastructure handles flexibility, distribution, and constant change. Enterprise teams are feeling three structural pressures at the same time: - Compute flexibility: GPU supply is volatile and single-cloud/provider strategies mean fragility - Distributed inference: multi-model pipelines, long contexts, KV cache growth, and agentic workflows break traditional autoscaling - Speed of change: new models and inference techniques arrive faster than rigid stacks can absorb These pressures don’t fail loudly. They compound quietly, showing up as rising costs, brittle deployments, slow iteration, and missed launches. Leading teams are responding with clear infrastructure shifts: - Multi-cloud and hybrid orchestration as a baseline, not an edge case - Intelligent scheduling based on workload shape, not static provisioning - Distributed inference patterns like prefill–decode disaggregation and KV-aware routing - InferenceOps as a first-class discipline, with reproducible builds and unified observability - Unified serving foundations across LLMs, CV, RAG, and multimodal systems The takeaway is simple: the goal isn’t to chase every new optimization, it’s to build infrastructure that can absorb change without constant rewrites. Learn more 👉 https://lnkd.in/g-yWc74H

  • Most teams are still guessing their way through LLM inference. Default configs. Trial-and-error batching. Overprovisioned GPUs. Missed SLAs. This blog post from Josh Longenecker and Mohammad Tahsin shows what production-grade inference should actually look like: A closed-loop, data-driven workflow that systematically finds the optimal inference configuration instead of relying on guesswork. Using llm-optimizer, they demonstrate how to: - Model theoretical performance limits with roofline analysis - Systematically benchmark real configurations - Visualize Pareto frontiers across latency & throughput - Deploy the winning config directly into production endpoints The result is up to 41% lower latency and 2.4× higher throughput 🚀 If you’re running serious LLM workloads, this is a must read. Learn more about llm-optimizer here: https://lnkd.in/ga4QBpch

    🚀 New Blog: Stop leaving LLM inference performance on the table I’m excited to share my latest post on the AWS Machine Learning Blog: “Optimizing LLM Inference on Amazon SageMaker AI with BentoML LLM Optimizer” Here’s what drives me about this topic: your workload is unique, and your infrastructure should work for YOU. Too many teams deploy LLMs with default configs and accept whatever performance they get. But: ∙ Your traffic patterns aren’t generic ∙ Your latency requirements aren’t average ∙ Your hardware capabilities aren’t being fully utilized In this post, I walk through how to systematically optimize inference configurations with the LLM-optimizer tool to match YOUR specific needs - achieving up to 41% latency reduction and 2.4x throughput improvements. Understand your workload, tune your configs, and make your endpoints work as hard as you do. Read the full post: https://lnkd.in/e_KQutf4 What inference optimization challenges are you facing? Special thanks to my awesome co-author Mohammad Tahsin and to Felipe Lopez for the review and collaboration! #MachineLearning #LLM #AI #AWS #SageMaker #MLOps

  • 🎄 𝗢𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗟𝗟𝗠𝘀 𝗷𝘂𝘀𝘁 𝗵𝗮𝗱 𝗮𝗻 𝗶𝗻𝗰𝗿𝗲𝗱𝗶𝗯𝗹𝗲 𝘆𝗲𝗮𝗿. They learned to reason, code like caffeinated wizards, run agents, and do it all for less than your monthly oat-milk latte ☕💸 Meanwhile, proprietary models kept getting smarter and also gently reminded us how much convenience, limited control & customization are worth on the invoice 😏. We updated our 2026 Open-Source LLM Guide with everything that actually matters now: - the top open-source LLMs today - how big the open vs closed gap really is - which models to use for agents, coding, reasoning, and chat - how to run and optimize them efficiently in production - how to build competitive LLM applications If you’re building LLM applications in 2026, this is your cheat code 👉 https://lnkd.in/gsxQupTf 🎅 Merry Christmas and happy self-hosting!

Similar pages

Browse jobs

Funding