Excited to share our paper on Implicit Intelligence was accepted to [ICML] Int'l Conference on Machine Learning 2026. Do AI agents understand us, or just follow instructions? Real requests aren’t fully specified. Your phone already holds the context: a dinner reservation, the address in Maps, a text that someone’s late, and Do Not Disturb still on. When you ask an agent to "schedule a car pickup”, you naturally want it to weave all of this context together, knowing whether it should contact the restaurant to modify the dinner reservation, or when and where to schedule your car for. Today’s agents usually don’t. We built Implicit Intelligence to measure this gap across 205 real-world scenarios involving accessibility, privacy, and safety. Top models can still struggle, with Opus 4.6 reaching just 53.2% and GPT 5.2 Pro at 48.3%. It demonstrates there is still a large gap between how these models perform for user requests that deal with messy, real world context. Curious how other models compare or want to learn more about the methodology behind the benchmark? Check out our Leaderboard and paper (link in the comments). Ved Sirdeshmukh Marc Wetter
Labelbox
Software Development
San Francisco, California 36,818 followers
The data factory for leading AI teams
About us
Labelbox builds and operates reinforcement learning data factories for the world’s leading AI labs and enterprises, powering the next generation of frontier models and AI applications.
- Website
-
https://labelbox.com/
External link for Labelbox
- Industry
- Software Development
- Company size
- 51-200 employees
- Headquarters
- San Francisco, California
- Type
- Privately Held
- Founded
- 2018
Locations
-
Primary
Get directions
510 Treat Ave
San Francisco, California 94110, US
Employees at Labelbox
Updates
-
Hiring AI builders: Most AI progress today isn’t bottlenecked by models. It’s bottlenecked by data and environments. That layer shapes what models learn, how they behave, and whether they improve. It’s also the part people sometimes overlook but raises the ceiling of performance and is what Labelbox is focused on building. We’re hiring a small group to work at the frontier: - Forward Deployed Engineers (RL environments) - Forward Deployed Research Scientists - FDE Manager You’ll design environments, shape feedback loops, and work alongside leading AI teams pushing what these systems can do. If you want to build the infrastructure that actually determines how state-of-the-art AI systems learn and improve, reach out (JD + links in the comments 👇)
-
-
This week, we had the pleasure of hosting 50+ researchers and builders from leading AI companies to meet, talk and socialize (MTS 😎) at Labelbox HQ. Huge thanks to Dwarkesh Patel, Sholto Douglas (Anthropic), Mo Bavarian (OpenAI), and Melvin Johnson (DeepMind) for leading our fireside chat on scaling RL and the pursuit of AGI.
-
-
-
-
-
+1
-
-
Labelbox reposted this
Keeping customer and workforce data secure is the highest priority at Labelbox, using a trust platform that is thorough, continuous, and adaptive. https://lnkd.in/gZUpfD2s
-
🏆 Forbes’ 2026 list of America’s Best Startup Employers is out, and we’re proud to see Labelbox on the list. We’re committed to enabling the next generation of AI by powering the data and evaluation for the world’s most advanced teams. Recognition like this reflects the people building that mission every day. See the full list: https://bit.ly/4u8CumB
-
Voice agents are evolving from rigid turn-based designs toward continuous, natural conversation, enabling streaming comprehension and generation at the same time. However, most existing benchmarks are either turn-based or latency-focused and do not directly test whether models can maintain reasoning when users interrupt or update objectives mid-utterance. We introduce EchoChain 🔊, a novel benchmark for evaluating reasoning under pressure in full-duplex dialogue. Key findings: - Full-duplex models often fail to properly integrate interruption information, even so far as ignoring the interruption entirely in some cases. - A major weakness in today’s most advanced models is that they struggle to stay consistent when new input arrives while they’re still responding. - In many cases, a model performs well when it can respond without interruption, but struggles once it’s interrupted mid-response. Check out the full analysis in our blog post. Stay tuned for the arXiv paper as well which will be released in the coming days. https://lnkd.in/g3QkNZdb
-
Model safety is often judged by refusal rates on AI safety benchmarks. But what if our evaluations are flagging overtly negative or sensitive language rather than detecting genuine adversarial behavior? In our latest research, we show that when this language is removed, frontier models previously labeled as safe frequently fail, exposing a gap between how model safety is evaluated using benchmarks and how adversarial behavior occurs in the real world. Key findings: - AI safety benchmarks are over-reliant on explicit triggering language, provoking model refusals unrealistically. - Removing these cues significantly degrades safety performance, challenging prior assumptions about the robustness of safety evaluations. - We found evidence that both internal safety evaluations and safety alignment techniques use similar language patterns, further questioning the robustness of safety evaluations. - Our novel “intent laundering” framework serves as a strong diagnostic and red-teaming tool, exposing where model safety succeeds and where it fails. Read the full blog post for the complete analysis. https://lnkd.in/g84dywcR
-
Today, Dario (CEO of Anthropic) x Dwarkesh unpacked where AI is headed, from exponential scaling to what he calls a “country of geniuses in a data center". A few key takeaways: - RL is about generalization, not specialization: Like early pretraining, the goal isn’t mastering one task, but building rich environments and broad data so models generalize across domains. - 1–3 years to a “country of geniuses”: Dario estimates ~50/50 odds that AI systems collectively match the output of an entire nation of top experts in a few years. Not a single superintelligence, but millions of genius-level systems in parallel. - Context as the next unlock: With context windows in the tens of millions of tokens, models could absorb months of workflow in one pass. The goal: steerable, human-aligned systems, as opposed to unchecked autonomous actors. - Software engineering goes end to end: Models are moving from writing code to executing full engineering cycles: setup, debugging, iteration. Bottlenecks now shift from syntax to judgment. - Diffusion will lag capability, briefly: Enterprise adoption slows even with rapid growth, but AI can onboard itself via docs, Slack threads, and codebases. By compressing the adoption curve, trillions in AI-driven revenue by 2030 becomes realistic. Excited to be featured in this conversation, showcasing how we help leading AI teams build high-fidelity RL environments and tighten the iteration loop so models learn from the most informative experiences.
-
-
We're excited to share that we’ve acquired Upcraft to bring AI agents to the heart of how we scale human expertise for frontier AI. Upcraft’s AI-powered automation strengthens Alignerr by helping us recruit, engage, and empower a global network of domain experts who train and evaluate the world’s most advanced models. As leading AI teams invest billions into post-training and reinforcement learning, expert-generated data has become the true bottleneck for injecting models with the taste and judgement that only deep human expertise can provide. A big welcome to Greg Caplan and the Upcraft team and we look forward to building together. https://lnkd.in/g4rjRNeA
-
Elon x Dwarkesh x John Collison from Stripe just went live. Their almost three hour chat (over some Guinness 🍻) dives into what actually limits the next phase of AI and how Elon plans to break through. A few takeaways from this must-watch episode: - Space as the next data center: Solar power in orbit is roughly five times more effective than on Earth. Within thirty to thirty six months, Musk believes space could become the most economically viable location for AI compute, with Starship launching massive power and compute capacity into orbit. - Humanoid robots as the economic unlock: Optimus could be the ultimate productivity multiplier, potentially expanding the global economy by orders of magnitude. The hardest problem is hands. The endgame is robots that eventually build robots. - Power as the next bottleneck: Electricity production outside China is flat while compute demand is exploding. Musk says the true scaling wall for AI on Earth is utilities, not just models. - Debuggability as a safety requirement: Tools that show where a model’s reasoning went wrong, trace the origin of errors, or detect potential deception will be essential as AI grows more capable. - Efficiency as an existential issue: Interest on national debt now exceeds the military budget. Musk argues that massive productivity gains from AI and robotics are not optional. They are existential. We’re excited to be featured in the conversation, helping leading AI teams scale high quality robotics and reinforcement learning data so their models learn from the right experiences and reach their full potential.
-