MLflow

MLflow · 2026-05-27T15:06:40.068Z

Vibe-checking works until it doesn’t. Change one prompt, break three behaviors—and you can’t tell if you moved forward or backward. In this blog, Jules Damji maps eval-driven development in MLflow: trace → evaluate → version prompts → monitor in prod. Phase 1️⃣ — Trace: mlflow.openai.autolog() plus manual @mlflow.trace spans — latency, tokens, cost on every LLM/tool/retrieval step Phase 2️⃣ — Evaluate + Prompts: 🔹 SME feedback + mlflow.genai.evaluate() with RelevanceToQuery, ToolCallRelevance, Guidelines, and make_judge() for domain rules 🔹 Prompt Registry ties prompts to traces and scores; optimize_prompts (e.g. GepaPromptOptimizer) runs against your eval dataset Phase 3️⃣ — Prod: Agent dashboards for cost/latency/quality; same judges on live traces as offline Trace everything. Evaluate in layers. Version your prompts. Learn more 👉 https://lnkd.in/eJwEDX_8 #MLflow #GenAI #LLMOps #AIEvaluation #Observability #AgenticAI

Software Development

San Francisco, CA 77,742 followers

The open source AI engineering platform for building production-grade agents, LLM applications, and ML models.

View all 43 employees

About us

MLflow is an open-source platform for managing the complete machine learning lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow supports both traditional ML and generative AI workflows: - MLflow Tracking: Record and query experiments, including code, data, config, and results. Now with integrated tracing for GenAI workflows across multiple frameworks. - MLflow Models: Deploy machine learning models in diverse serving environments, with GenAI support for ChatModels and streaming interfaces. - Model Registry: Store, annotate, discover, and manage models in a central repository. - MLflow Evaluation: Evaluate model performance using customizable metrics, including LLM-as-judge frameworks and GenAI-specific benchmarks. - MLflow Deployments: Simplify model deployment and serving across various platforms, with expanded capabilities for hosting large language models. Subscribe to our luma calendar for updates about meetups, office hours, and other events: https://lu.ma/mlflow View code on GitHub here: https://github.com/mlflow/mlflow/ To discuss or get help, please join our mailing list mlflow-users@googlegroups.com

Website: https://mlflow.org/
External link for MLflow
Industry: Software Development
Company size: 2-10 employees
Headquarters: San Francisco, CA
Type: Nonprofit
Founded: 2018

Locations

Primary

San Francisco, CA, US

Get directions

Employees at MLflow

See all employees

Updates

MLflow

77,742 followers
2d
Report this post
You cannot evaluate at scale if conversations are not durable artifacts the team can replay and inspect. This cookbook pulls conversations from a Genie space and logs each as an MLflow trace for inspection and evaluation. 🔹 Conversation ingest — Genie chats land as structured traces 🔹 Collaboration — shared artifacts for QA and data teams 🔹 Foundation for eval — judges and dashboards sit on the same trace store Implement ingest and tracing first, then layer eval on top. 👉 Read the cookbook: https://lnkd.in/eiUdt_da #MLflow #Genie #Observability #GenAI #LLMOps
Like Comment Share
MLflow

77,742 followers
3d
Report this post
In production, there's a trust gap: LLMs can leak PII, produce harmful content, or violate content policies. In this clip, Tomu Hirata (Databricks) explains why application-level filters don't scale, especially as agents make many LLM calls, and why guardrails belong in the AI Gateway instead. 🔹 App-level guardrails are hard to apply consistently across every LLM invocation 🔹 A bug in application code can skip policies entirely 🔹 AI Gateway guards enforce policies in one centralized place, consistently 🎥 Full video: ttps://https://lnkd.in/eTdjvdrv #MLflow #opensource #LLM #AIGateway

Like Comment Share
MLflow

77,742 followers
1w
Report this post
A failed evaluation only helps if it becomes an edit someone can ship. This cookbook takes traces that failed evaluation, combines them with Genie space configuration, and uses an LLM to generate copy-paste-ready fixes. 🔹 Failed-trace input — ground suggestions in what actually broke 🔹 Space config context — reduce generic advice that does not apply 🔹 Shorter fix cycle — from red score to concrete patch proposal Run it on your worst-eval traces this week and review diffs like normal code. 📕 Read the cookbook: https://lnkd.in/eFeFNemt #MLflow #Genie #GenAI #LLM
Like Comment Share
MLflow

77,742 followers
1w
Report this post
In this clip from our MLflow 3.12 deep dive, Yuki Watanabe walks through why coding agents need tracing, and what shows up in the trace when you turn it on: 🔹 Every turn, tool call (Read, Bash, Edit), and sub-agent step, not just the final output 🔹 Token usage and latency per span, including cache breakdown 🔹 Full sessions grouped together, so long conversations stay debuggable 🔹 MLflow 3.12 tracing for Claude Code, Codex, Gemini CLI, OpenCode, Qwen Code, and OpenHands 🎥 Full webinar: https://lnkd.in/ezeCAFCQ #MLflow #CodingAgents #AIObservability #GenAI #LLMOps

Like Comment Share
MLflow

77,742 followers
1w
Report this post
Sharing a self-hosted MLflow server across a team used to mean granting permissions one resource at a time, with no central place to manage them. MLflow 3.13.0 adds a full Role-Based Access Control (RBAC) system and a new Admin UI. 🔐 RBAC: Define roles as reusable bundles of permissions, assign them to users, and use workspace-level grants for membership and admin authority. Effective access is the union of a user's roles. 📦 Coverage: Experiments, models, prompts, scorers, and AI Gateway endpoints are all included. 🖥️ Admin UI: Manage users, roles, and grants without touching REST endpoints. Self-service /account page for viewing roles and changing your password. Run mlflow server with authentication enabled to get started. ✅ Release highlights: https://lnkd.in/g6f4MEHZ #MLflow #MLOps #LLMOps #OpenSource #AIEngineering
Like Comment Share
MLflow

77,742 followers
1w
Report this post
MLflow 3.13.0 is a major update that runs AI observability at scale, focusing on access control, the lifecycle of your trace data, and richer support for agents. Here are the highlights of the release: 🔒 💾 Role-Based Access Control & Admin UI: A full RBAC: system with reusable roles and workspace-scoped grants, plus a new web Admin UI for managing users, roles, and permissions on self-hosted MLflow. 💾 Trace Retention & Auto Archival: Automatically move aged trace span data out of your SQL backend into object storage (e.g. S3) while keeping every trace fully readable in the UI and APIs. 🔗 One-click observability & governance for coding agents: Onboard Claude Code, OpenAI Codex, or Gemini CLI to the AI Gateway in one click for tracing, usage tracking, budgets, and guardrails. 🚂 New engines for MLflow Assistant: Run OSS MLflow Assistant on a local Ollama model, the OpenAI Codex CLI, or any MLflow AI Gateway endpoint, in addition to Claude Code. 📊 Helm chart for Kubernetes: An official, production-ready Helm chart for deploying the MLflow tracking server to any Kubernetes cluster. 🤖 Hermes Agent support: Route the Hermes Agent runtime through the AI Gateway and capture its end-to-end traces in MLflow over OpenTelemetry. 📜Span log levels: Python-logging-style severity levels on spans, with a "Minimum log level" filter in the trace UI to hide low-level noise. For visual demos of the above features, check out the release post on our website! For a complete list of updates and bug fixes, check out the full changelog. Share Your Feedback: We'd love to hear about your experience with these new features 👉 GitHub Issues- Report bugs or request features 👉 MLflow Roadmap- See what's coming next and share your ideas 🌟 Star us on GitHub - Show your support for the project Read the release post here for quick features: https://lnkd.in/g6f4MEHZ
1 Comment

Like Comment Share
MLflow

77,742 followers
1w
Report this post
Bad SQL is worse when it looks tidy and reads confident. This cookbook scores Genie traces with built-in and custom LLM judges to find quality issues in responses and SQL generation. 🔹 Trace-level judging — score Genie runs at volume, not one-off review 🔹 Built-in judges — baseline coverage on common failure classes 🔹 Custom judges — encode domain-specific SQL and semantics checks Point judges at your highest-risk traces first. 👉 Read the cookbook: https://lnkd.in/eR5EvTAS #MLflow #Genie #SQL #GenAI #Evaluation
Like Comment Share
MLflow

77,742 followers
1w
Report this post
Observability captures every trace, not which conversations went wrong or frustrated users. Automatic Issue Detection in MLflow turns hours of manual review into structured findings in three clicks. 🔹 CLEARS: Pick quality dimensions (Correctness, Latency, Execution, Adherence, Relevance, Safety). 🔹 Detect: LLM samples traces, clusters failures, annotates evidence, summarizes severity and next steps. 🔹 Triage: Pending, Resolved, or Rejected in the MLflow UI (3.11.1+). 🔗 Learn more: https://lnkd.in/gMEf9RbJ #MLflow #LLMOps #AIObservability #GenAI #MLOps

Like Comment Share
MLflow

77,742 followers
2w
Report this post
Analytics agents need the same rigor as coding agents when answers depend on SQL and business semantics. This cookbook is a full pipeline for tracing, evaluating, and improving a Databricks Genie space with MLflow. 🔹 Genie tracing — durable record of Genie behavior end to end 🔹 Targeted evaluation — score where SQL or semantics mistakes cost most 🔹 Improvement loop — turn usage into a measured iteration cycle Pilot on one Genie space before rolling the pattern org-wide. 👉 Read the cookbook: https://lnkd.in/ea5fhARt #MLflow #Genie #GenAI #DataAnalytics #LLMOps
2 Comments

Like Comment Share
MLflow

77,742 followers
2w
Report this post
Vibe-checking works until it doesn’t. Change one prompt, break three behaviors—and you can’t tell if you moved forward or backward. In this blog, Jules Damji maps eval-driven development in MLflow: trace → evaluate → version prompts → monitor in prod. Phase 1️⃣ — Trace: mlflow.openai.autolog() plus manual @mlflow.trace spans — latency, tokens, cost on every LLM/tool/retrieval step Phase 2️⃣ — Evaluate + Prompts: 🔹 SME feedback + mlflow.genai.evaluate() with RelevanceToQuery, ToolCallRelevance, Guidelines, and make_judge() for domain rules 🔹 Prompt Registry ties prompts to traces and scores; optimize_prompts (e.g. GepaPromptOptimizer) runs against your eval dataset Phase 3️⃣ — Prod: Agent dashboards for cost/latency/quality; same judges on live traces as offline Trace everything. Evaluate in layers. Version your prompts. Learn more 👉 https://lnkd.in/eJwEDX_8 #MLflow #GenAI #LLMOps #AIEvaluation #Observability #AgenticAI
1 Comment

Like Comment Share

MLflow

Software Development

San Francisco, CA 77,742 followers

The open source AI engineering platform for building production-grade agents, LLM applications, and ML models.

About us

Locations

Employees at MLflow

Stavros N.

Hamza Usman Ghani

Joel Robin P

Aman Kumar

Updates

Join now to see what you are missing

Similar pages

Databricks

Kubeflow

LangChain

Delta Lake

FastAPI

LlamaIndex

Pydantic

Ollama

PyTorch

Hugging Face

Browse jobs

Engineer jobs

Software Engineer jobs

Vice President jobs

Firmware Engineer jobs

Project Manager jobs

Digital Marketing Director jobs

Content Strategist jobs

Intelligence Specialist jobs

Scientist jobs

Intern jobs

Human Resources Business Partner jobs

Head jobs

Associate jobs

Manager jobs

Senior Analyst jobs

User Interface Designer jobs

Analyst jobs

Data Scientist jobs

Insights Manager jobs

Junior Software Engineer jobs