Epoch AI’s cover photo
Epoch AI

Epoch AI

Research Services

San Francisco, California 6,789 followers

Research institute investigating the trajectory of AI

About us

Epoch AI is a multidisciplinary research institute investigating the trajectory of Artificial Intelligence (AI). We scrutinize the driving forces behind AI and forecast its ramifications on the economy and society. We emphasize making our research accessible through our reports, models and visualizations to help ground the discussion of AI on a solid empirical footing. Our goal is to create a healthy scientific environment, where claims about AI are discussed with the rigor they merit.

Website
https://epoch.ai
Industry
Research Services
Company size
11-50 employees
Headquarters
San Francisco, California
Type
Nonprofit
Founded
2022
Specialties
AI Governance and AI Forecasting

Locations

  • Primary

    166 Geary St

    STE 1500 #1917

    San Francisco, California 94108, US

    Get directions

Employees at Epoch AI

Updates

  • We looked at OSWorld, a popular evaluation of AI computer use capabilities. Our findings: tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time. OSWorld consists of 361 tasks sourced from forums and tutorials. Models get an Ubuntu VM and task instructions, and write code to interact with the mouse and keyboard. See the second image for one task's starting state. Instructions: "Make a duplicate of the last two slides for me, please." Most tasks are realistic but relatively simple, requiring fewer than ten steps (clicks, text inputs, etc.). These tasks take humans only a few minutes to complete. Contrary to standard practice, OSWorld is updated continuously. A major update in July changed most tasks, and 10% of task instructions have been updated since then. Furthermore, 10% of tasks rely on live data from the web. This makes it hard to compare results over time. OSWorld is about computer use, but many tasks require little use of graphical user interfaces. About 15% can be solved with only the terminal and a further 30% can rely heavily on Python scripts. We even found cases of models downloading packages to manipulate spreadsheets. We also noticed ambiguity in several task instructions. While this *is* realistic, it makes interpreting progress on OSWorld harder. If a model improves, is that because it got better at computer use, or because it got better at guessing the intent of the tasks? Finally, we found about 10% of tasks to have serious errors (though this isn't an unusual rate). Some tasks have incorrect answer keys, some evaluation functions are too strict or too lax, and some instructions are fatally ambiguous. Read the full report here: https://lnkd.in/dpyVfGEv

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • We found a bug in our benchmarking code: calls to GPT-5 with "high" reasoning were silently being set to "medium". Corrected results: GPT-5 (high) scores slightly higher than GPT-5 (medium) on the benchmarks we run. They are also now tied on the Epoch Capabilities Index (ECI). The error was caused by us using an old version of the Inspect evaluations library. This version ignored "reasoning effort" for OpenAI models unless the model's name began with "o", like o3. The fix was to update our Inspect version. We noticed this in an unrelated investigation. Why is GPT-5 (high) tied with GPT-5 (medium) on ECI, and not ahead? ECI uses scores for benchmarks beyond the ones we run. On several of those, GPT-5 (medium) scores well whereas GPT-5 (high) hasn't been evaluated. This boosts GPT-5 (medium)'s score relative to GPT-5 (high). We'll always share mistakes and correct the record. We have updated all of our GPT-5 benchmarking data on our benchmarking hub, as well as all analysis using that data. https://lnkd.in/dSXWFqdR

    • No alternative text description for this image
  • We used our new capabilities index, the ECI, to measure the gap between open- and closed-weight models. The result? This gap is smaller than previously estimated. On average, it takes 3.5 months for an open-weight model to catch up with closed-source SOTA. Put another way, the average gap since 2023 is 7 points on the ECI scale. That's similar to the gap between many closed source flagship models and their "lite" versions, like GPT-5 vs GPT-5 mini, or Gemini 2.5 Pro vs Gemini 2.5 Flash. You can find more details on our website! https://lnkd.in/d2T9VRyH

    • No alternative text description for this image
  • Conventional wisdom in AI is that large scale pretraining needs to happen in massive contiguous datacenter campuses. But is this true? Using underutilized generation as a proxy of power availability, we identify a 4,800 km network of 23 sites in the U.S. that could theoretically support a 10GW distributed AI cluster, helping to alleviate power bottlenecks. To use such a distributed cluster for training, we consider fully synchronous data parallelism. In this setup, each site will process a part of each training batch, and then synchronize to update model weights in unison. This strategy has been tested before by NVIDIA to train a Nemotron-4 340B model between two datacenters 1,000km apart. Such a setup will only be practical if the time spent synchronizing is sufficiently small compared to the time it takes to process each batch. The batch processing time is in turn determined by the critical batch size, which has been shown to scale with the dataset size. For synchronization, we will consider a bidirectional ring all-reduce algorithm. This allows us to complete a synchronization with a single round trip around the network. The synchronization time is then determine by the point-to-point network bandwidth, and bounded by the network latency. Achieving sufficient bandwidth to keep synchronization time under control sounds daunting. We estimate that to train 72T-parameter models we would need over 25x the bandwidth of the MAREA transoceanic fiber cable — the highest-capacity internet cable crossing the Atlantic. However, the main cost of fiber deployment is installation, so increasing bandwidth is cheap compared to the overall cost of datacenter construction. The bottom line: Conducting large decentralized training runs is feasible without a large increase in either training time or budget. However, distributed clusters have many downsides: - more complex permitting processes - additional engineering to manage a long-range network connection and reliability - constraints on communication-heavy paradigms We expect that AI companies will prefer to scale AI campuses as much as they can, and only resort to distributed clusters to go beyond the scale that utilities are willing to provide through the grid. Evidence for this is Microsoft’s Fairwater datacenter in Wisconsin, a planned multiple GW site that will allegedly become “part of a global network of Azure AI datacenters”— meant to “enable large-scale distributed training across multiple, geographically diverse Azure regions”.

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • We've launched a new tool to track AI progress! The tool addresses one of the field's biggest challenges: benchmark saturation. It's called the Epoch Capabilities Index (ECI) — here's what makes it different: Individual AI benchmarks saturate quickly—sometimes within months. This makes it hard to track long-term trends. However, by combining scores from different benchmarks, we created a single scale that captures the full range of model performance over time. The new index is based on Item Response Theory, a standard statistical framework that allows us to combine benchmarks of varying difficulty and quality. We can even incorporate benchmarks of older models that are no longer evaluated. ECI is a relative measure, somewhat akin to Elo scores, which rates model capabilities and benchmark difficulty. Models are more capable if they beat benchmarks, especially difficult ones. Benchmarks are difficult if they stump models, especially capable ones. We think ECI is a better indicator of holistic AI capability than any single benchmark. It currently covers models from 2023 on, and it allows us to track trends in capabilities as they emerge. Note that the full range of a model's capabilities can't be captured by a single number. ECI tracks how capable a model is across many benchmarks. Specialized models may perform well on individual benchmarks but nevertheless get a low ECI. We'll be updating ECI with new models and benchmarks. Our methodology is open source, and we welcome feedback from the research community. Check out the ECI on our Benchmarking Hub for interactive visualizations, methodology details, and data downloads: https://lnkd.in/dSXWFqdR The Epoch Capabilities Index builds on research done with support and collaboration from @GoogleDeepMind. Keep an eye out for our forthcoming paper!

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • Large language models can imitate reasoning steps and even verify formal proofs. But mathematical physicist Svetlana Jitomirskaya argues they lack folklore knowledge: the implicit priors mathematicians build from experience. Humans can form intuition from a handful of examples; models still need far more data to see the same patterns. Link to video in comments!

  • We're hiring for a new Researcher to join our Data & Trends team. Applications are rolling, so apply soon or refer your friends! The Researcher will help us gather data, conduct research, and write reports. Ideal candidates might have prior research experience in GPU manufacturing and design, data centers, AI infrastructure, geospatial data analysis, or related fields. We’re considering remote candidates, although we’ll have a preference for folks able to work in-person with our Head of Data and Trends in our Berkeley office. https://lnkd.in/dRpvgmVz

    • No alternative text description for this image
  • Stanford mathematician Ravi Vakil, president of the American Mathematical Society, expects AI’s impact on mathematics to come as a phase change, not a slow climb. Every major shift in math has caught experts off guard, he says. This one will be no different, except that all our predictions will be even more wrong. Link to video in comments!

  • We evaluated Claude Haiku 4.5 on several benchmarks. Even with reasoning disabled, Haiku 4.5 performs similarly or better than early lightweight reasoning models, like o1-mini. o1, and its compute-light variant o1-mini, were among the first widely available models explicitly marketed as “reasoning” models. Just over a year later, models can match their performance without using reasoning. This can contribute to faster runtimes, for example Haiku’s Mock AIME run was ~5x faster than o1-mini in our setup. See more evaluations and trends in AI capabilities in the Epoch AI benchmarking hub! https://lnkd.in/dSXWFqdR

    • No alternative text description for this image

Similar pages

Browse jobs