Ought

Ought has spun off Elicit

Owain Evans, Paul Christiano, Owen Cotton-Barratt — Mon, 25 Sep 2023 00:00:00 GMT

Ought has spun off Elicit as a public benefit corporation, selling its IP and transferring most of its staff. The decision to sell the IP was made by an independent board of directors (Owain Evans, Paul Christiano, and Owen Cotton-Barratt), without conflicts of interest in the new entity, and advised by nonprofit lawyers and aided by an independent valuation of the IP.

We believe that this decision is the best way to pursue Ought’s mission, by allowing Elicit to raise funding and scale up its efforts. We also believe that it is a prudent financial decision narrowly on the part of Ought based on the value of the equity stake Ought now holds in Elicit.

Elicit will share Ought’s mission: ensuring that continuing progress in ML can be used to help humans arrive at good answers to challenging questions. We believe that it may be important to keep this capability as advanced as technologically feasible at any moment in time.

Elicit has now raised $9 million in funding, and most former Ought staff are working at the new organization. To learn more about the organization’s plans you can follow the Elicit blog. We expect operations at Ought to be minimal for the next few years. If Elicit performs well and returns profits to its investors then Ought may resume activities.

By the Ought Board of Directors

AI Safety Needs Great Product Builders

James Brady — Wed, 02 Nov 2022 00:00:00 GMT

In his AI Safety Needs Great Engineers post, Andy Jones explains how software engineers can reduce the risks of unfriendly artificial intelligence. Even without deep ML knowledge, these developers can work effectively on the challenges involved in building and understanding large language models.

I would broaden the claim: AI safety doesn’t need only great engineers – it needs great product builders.

This post will describe why, list some concrete projects for a few different roles, and show how they contribute to AI going better for everyone.

Audience

This post is aimed at anyone who has been involved with building software products: web developers, product managers, designers, founders, devops, generalist software engineers, … I’ll call these product builders.

Non-technical roles (e.g. operations, HR, finance) do exist in many organisations focussed on AI safety, but this post isn’t aimed at them.

But I thought I would need a PhD!

In the past, most technical AI safety work was done in academia or in research labs. This is changing because – among other things – we now have concrete ideas for how to construct AI in a safer manner.

However, it’s not enough for us to merely have ideas of what to build. We need teams of people to partner with these researchers and build real systems, in order to:

Test whether they work in the real world.
Demonstrate that they have the nice safety features we’re looking for.
Gather empirical data for future research.

This strand of AI safety work looks much more like product development, which is why you – as a product builder – can have a direct impact today.

Example projects, and why they’re important

To prove there are tangible ways that product builders can contribute to AI safety, I’ll give some current examples of work we’re doing at Ought.

For software engineers

In addition to working on our user-facing app, Elicit, we recently open-sourced our Interactive Composition Explorer (ICE).

ICE is a tool to help us and others better understand Factored Cognition. It consists of a software framework and an interactive visualiser:

On the back-end, we’re looking for better ways to instrument the cognition “recipes” such that our framework stays out of the user’s way as much as possible, while still giving a useful trace of the reasoning process. We’re using some meta-programming, and having good CS fundamentals would be helpful, but there’s no ML experience required. Plus working on open-source projects is super fun!

If you are more of a front-end developer, you’ll appreciate that representing a complex deductive process is a UX challenge as much as anything else. These execution graphs can be very large, cyclic, oddly and unpredictably shaped, and each node can contain masses of information. How can we present this in a useful UI which captures the macro structure and still allows the user to dive into the minutiae?

This work is important for safety because AI systems that have a legible decision-making process are easier to reason about and more trustworthy. On a more technical level, Factored Cognition looks like it will be a linchpin of Iterated Distillation and Amplification – one of the few concrete suggestions for a safer way to build AI.

For product managers

At first, it might not be obvious how big of an impact product managers can have on AI safety (the same goes for designers). However, interface design is an alignment problem – and it’s even more neglected than other areas of safety research.

You don’t need a super technical background, or to already be steeped in ML. The competing priorities we face every day in our product decisions will be fairly familiar to experienced PMs. Here are some example trade-offs that we regularly navigate for our app, Elicit:

Deliver terse, valuable insights to users

Simple answers and compact summaries make our product feel magical and save users’ time.

vs.

Expose the inner-workings of the system

Revealing what is happening under the covers makes our product more trustworthy and builds trust.

Follow familiar product paradigms

Users get more immediate value from interfaces which feel familiar.

vs.

Imagine radical new workflows

Language models offer opportunities for novel interaction styles which might be more valuable in the medium-term.

Get users quick answers to their questions

Quick answers let our users power through well-defined tasks.

vs.

Help users carefully navigate a complex task

Perhaps we offer more lasting value by acting as a reasoning assistant – it certainly better fits our overall mission.

Compensate for limitations in today’s language models

Language models have various known limitations, and our current product is valuable to users when we work around those limitations.

vs.

Build functionality which scales to powerful future models

Limitations are going to retreat and change with every improvement to language models. We want our work to be super-charged by change, rather than deprecated by it.

In my opinion, the main thing which makes product management at Ought different from other places is that we are working on the frontier of new technology. We regularly dream up features which turn out to just not be possible – even using today’s most powerful models. Our product direction isn’t only informed by our strategy and our users, it’s also influenced by what can be realised on the cutting edge of machine learning.

Great product managers can have an enormous impact on AI safety by helping us find the right balance between:

Proving our app is useful, and
Proving our approach is safer

If we lean too much towards adding crowd-pleasing widgets to our product, we won’t make enough progress on our mission to prove out process-based systems. On the other hand, if we lean too much towards researching process-based systems, we will – at best – prove they’re theoretically interesting rather than actually useful.

For infrastructure engineers

Because of the nature of the research we’re doing and the design of the product we’re building, we’re hitting a bunch of different ML APIs to do different jobs in different places.

At the moment, we don’t have the infrastructure to record all of these interactions with 3rd party and internal models. We’d like to insert an abstraction layer between our code and the ML models to achieve a few things:

The ability to fallback to alternatives when a model goes down or has increased latency (this is still quite common with 3rd party models).
Build up datasets of task/result pairs, which we can then use to train our own models in the future. These training data would, by their nature, be representative of the queries our users are interested in. We could even enrich the dataset with user feedback from the app: if someone marked one of our answers as “not good”, we can use that to improve the models on the next iteration.
Enable us to split traffic to a couple of different back-end models, so that we can compare them head-to-head. We’d like to understand how operational metrics like latency, throughput, and error rate compare between different APIs, using real traffic.

Implementing this abstraction layer would be challenging work. Obviously, when running in our production environment performance and reliability is critical. In contrast, for internal experiments we’d like the system to be highly flexible and easy to modify. There are some tools for tracking datasets and versioning, but it’s a relatively immature space compared to conventional infrastructure: there’s not an enormous toolchain for you to learn!

This is a key component of our AI safety work because building up these datasets of task/result pairs is one way to do the “Distillation” part of Iterated Distillation and Amplification. For example, imagine we have a complex and expensive Factored Cognition process running in our infrastructure. After 1000, 10000, or 100000 examples we could hot-swap in a new model trained on the real data captured at your abstraction layer.

Why would you do this?

Hopefully the examples above show that there is a wide range of product building work which can help with AI safety. But there are lots of exciting opportunities out there! What makes this area so special?

Here are some of the things that excite me the most about working on AI safety:

Impactful work

80,000 Hours rates AI safety as one of its highest priority causes because the risks are astronomical, there are tangible things we can do to help, and talent is currently the main constraint.

The people are fantastic

Because it’s a nascent field with an altruistic bent, your colleagues and peers will generally have strong prosocial motivations. They will be gifted, dedicated, interesting team mates who aren’t just doing it for a paycheck.

Talking of paychecks…

Many of the organisations working on AI safety (including Ought!) are well-funded. You can work on something important without needing to sacrifice your creature comforts, or priorities like Earning To Give.

Interesting, challenging work every day

Nowadays, a lot of product development work is highly commoditised. We have enough frameworks, methodologies, tools, and services to make software projects feel like building a Lego set, rather than a truly novel challenge. This is not yet true with AI. New models, architectures, and techniques are being developed all the time, and there’s a tight feedback loop between academia and industry.

In fact, if you fit the profile of the audience of this post, and the points I just made above resonate with you, you might find that AI safety is your Ikigai – I did!

What to do now?

If this post has resonated with you and you’d like to know more about Ought:

Try our app, Elicit, to better understand what we’re building.
Go through our Factored Cognition Primer, through which you’ll develop hands-on experience with the approach we’re taking.
Read Supervise Process, not Outcomes to learn more about the framework within which our work sits.
Take a look at our open roles!

If you’re interested in working in AI safety more generally:

Preventing an AI-related catastrophe is a comprehensive and up-to-date overview of the cause area.
80,000 Hours also offers 1-1 advice to people looking to move to a high-impact career.
We’re starting a reading group aimed at the same sorts of people that this post is catered to. You can register your interest here!

My thanks to Jungwon Byun, Andreas Stuhlmüller, Odette Brady, Eric Arellano, Jess Smith, and Maggie Appleton for their contributions to this post.

A Library and Tutorial for Factored Cognition with Language Models

Andreas Stuhlmüller — Thu, 06 Oct 2022 00:00:00 GMT

We want to advance process-based supervision for language models. To make it easier for others to contribute to that goal, we've released code for writing compositional language model programs and a tutorial that explains how to get started:

The Interactive Composition Explorer (ICE) is a library for writing and debugging compositional language model programs.
The Factored Cognition Primer is a tutorial using examples to explain how to write such programs.

We've been using ICE as part of our work on Elicit and have found it useful in practice.

Interactive Composition Explorer (ICE)

ICE is an open-source Python library for writing, debugging, and visualizing compositional language model programs. ICE makes it easy to:

Run language model recipes in different modes: humans, human+LM, LM
Inspect the execution traces in your browser for debugging
Define and use new language model agents, e.g. chain-of-thought agents
Run recipes quickly by parallelizing language model calls
Reuse component recipes such as question-answering, ranking, and verification

ICE looks like this:

Factored Cognition Primer

The Factored Cognition Primer is a tutorial that explains (among other things) how to:

Implement basic versions of amplification and debate using ICE
Reason about long texts by combining search and generation
Run decompositions quickly by parallelizing language model calls
Use verification of answers and reasoning steps to improve responses

The Primer looks like this:

If you end up using either, consider joining our Slack. We think that factored cognition research parallelizes unusually well and would like to collaborate with others who are working on recipes for cognitive tasks.

To learn more about how we've been using ICE, watch our recent Factored Cognition lab meeting.

How to use Elicit responsibly

Jungwon Byun and Andreas Stuhlmüller — Mon, 25 Apr 2022 00:00:00 GMT

Elicit has gotten exciting coverage on Twitter the last few days, leading to an influx of new users [1, 2, 3]. Welcome! We’re so excited to have you and grateful for your interest.

Alongside the overwhelmingly positive response, some people wisely pointed out the need for more transparency about who is building Elicit, how it works, and where it doesn’t. We’ll start the conversation with this note, but we expect this will be an ongoing dialogue.

Table of Contents

Most important takeaways

Elicit helps with, but does not automate, literature reviews

Elicit is an early product using early technology, attempting to help with complex topics. You are the researcher and the expert, not Elicit. Elicit results should be taken as a starting point for your further review and evaluation.

Elicit is only as good as the research it uses

While we think researchers are a very careful and rigorous group on average, there is research with questionable methodology and even fraud. Elicit does not yet know how to evaluate whether one paper is more trustworthy than another, except by giving you some imperfect heuristics like citation count, journal, critiques from other researchers who cited the paper, and certain methodological details (sample size, study type, etc.). We’re actively researching how best to help with quality evaluation but, today, Elicit summarizes the findings of a bad study just like it summarizes the findings of a good study.

Similarly, when you are impressed by Elicit results, it’s in large part because some researchers poured their blood, sweat, and tears into doing the actual research and presenting it. They also deserve your admiration :)

Double-check Elicit's work

Confirm that Elicit’s summaries and extracted information are correct by clicking each row and reviewing the abstract or full text of the paper. Search for both sides of your question to minimize confirmation bias.

The team building Elicit

Elicit is built by Ought, a non-profit machine learning research lab with a team of eight people distributed across the Bay Area, Austin, New York, and Oristà. Our team brings experiences from academia, mature tech, and startups. Ought is funded by grants from organizations like Open Philanthropy, Jaan Tallin, Future of Life Institute, and other individuals identifying with the effective altruism and longtermism communities. Our funders and team are primarily motivated by making sure that artificial intelligence goes well for the world, in part by being useful for high-quality work like research. Elicit is the only project that Ought currently works on.

How Elicit works

Elicit is an early-stage product, with updates and improvements every week (as documented on our mailing list). As of April 25, 2022, the literature review workflow is implemented as follows (in the interest of sharing a lot of information quickly, we’ve unfortunately had to assume a lot of technical context):

You enter a question.
We search for relevant papers, shown one per row in the results table:
We convert your question to keywords by leaving out stop words like “the”, “a”, and “and.”
We retrieve the title and abstract for the top 1,000 results from the Semantic Scholar API for the keywords you provide, applying additional filters if you select them (keywords, dates, or study type).

If you star papers and click “show more like starred”, we instead retrieve paper candidates by expanding the citation network of the starred papers forward and backward, again using the Semantic Scholar API.
We rank the retrieved papers based on relevance to your query, using the GPT-3 Babbage search endpoint for the first ranking step, then a finetuned T5 model for the second ranking step. The ranking step takes titles and abstracts as input, and computes how relevant each (title, abstract) pair is to your question.
For each paper, we also retrieve title, abstract, authors, citations, and a few other bits of metadata from Semantic Scholar, and the PDF link from Unpaywall.
We return the top eight papers to the next step.
We generate additional information for the top 8 papers, mostly shown in the supplementary columns you can add to the results table.
We generate what the paper’s abstract implies about your question using a GPT-3 Davinci model finetuned on roughly 2,000 examples of questions, abstracts, and takeaways.

This column was previously called “Answer to your question” but that sounded too strong and it was unclear where the answer was coming from. We’ve renamed it to “Takeaway from abstract.”

We use the prompt-based GPT-3 Davinci Instruct model for some of the supplementary columns (e.g., number of participants, number of studies), and a fine-tuned Curie model for others (e.g., intervention, dose). For the prompt-based Instruct model, the prompt looks like this, with “...” replaced with the query and paper details:

Answer the question "..." based on the extract from a research paper.
Try to answer, but say "... not mentioned in the paper" if you really don't know how to answer.
Include everything that the paper excerpt has to say about the answer.
Make sure everything you say is supported by the extract.
Answer in one phrase or sentence.

Paper title: ...

Paper excerpt: ...

Question: ...

Answer:

We use a finetuned GPT-3 Davinci model to compute the “Takeaway suggests yes/no” column.
We use a bag-of-words SVM to classify which studies are randomized controlled trials.
If you open the paper detail modal, we surface the citations most likely to criticize methodology by first ranking citations from Semantic Scholar using the GPT-3 Ada search endpoint, then further using a finetuned GPT-3 Curie model.
As we go through these steps, we stream information back to you as soon as it’s computed. For example, we return the titles of papers before we’ve computed the claims.

Much of this workflow will change in the future based on user feedback and internal evaluations. We already know that we’re soon going to:

Replace the keyword search with a semantic search that uses a vector representation for the meaning of the original query (including stopwords), which allows us to retrieve relevant documents even when they are not exact keyword matches.
Support generating additional information for the top papers based on the full text of open-access papers, not just based on the abstract.

Elicit's limitations

To help you calibrate how much you can rely on Elicit, we’ll share some of the limitations you should be aware of as you use Elicit:

Limitations specific to Elicit

Elicit uses language models, which have only been around for three years. While already useful, these early stage technology are far from “Artificial general intelligence that takes away all of our jobs.”

The models aren’t explicitly trained to be faithful to a body of text. We’ve had to customize the models to make sure their summaries or extractions are actually what is said in the abstract, and not what the model thinks is likely to be the case in general (sometimes called "hallucination"). While we’ve made a lot of progress and try hard to err on the side of Elicit saying nothing rather than saying something wrong, in some cases Elicit can miss the nuance of a paper or misunderstand what a number refers to.
Elicit is a very early stage tool and we launch things uncomfortably beta to iterate quickly with user feedback. It’s more helpful to think of Elicit-generated content as around 80-90% accurate, definitely not 100% accurate.
Other people have also helpfully shared thoughts on limitations [1, 2].

Limitations that apply to research or search tools in general

As we discussed at the start, Elicit is only as good as the papers underlying it. There are some bad papers we have yet to filter out and there are some important papers not yet in our dataset.
In the same way that good research involves looking for evidence for and against various arguments, we recommend searching for papers presenting multiple sides of a position to avoid confirmation bias.
Elicit works better for some questions and domains than others. We eventually want to help with all domains and types of research but, to date, we’ve focused on empirical research (e.g. randomized controlled trials in social sciences or biomedicine) so that we can apply lessons from the systematic review discipline.

Other thoughts on limitations

This section is really way too short. We tried to share enough to make you not overrely on Elicit but this is not a comprehensive list of possible limitations.

Lastly, more users than we expected might mean that the app breaks. So far, our engineering team has done a phenomenal job keeping the site up. But if you encounter an error, please let us know at help@elicit.org and thanks for understanding.

Suggestions for how to relate to Elicit

Given these limitations, here are some ways to relate to Elicit that can be useful without leading to undue confidence in Elicit’s abilities.

Get a broader perspective than you could have otherwise

Elicit’s tabular interface with columns that highlight key information about studies aims to make it easier for you to review a study in the context of other studies and to get a preliminary understanding of more studies. This can’t replace digging into the studies and understanding them carefully, but Elicit may be more effective than other tools at showing varied or conflicting perspectives.

Find papers that you may not have found elsewhere

The same query may get you different results in different paper search databases. This can be because different search tools have more or different papers or because they rank papers differently. Elicit can supplement other search tools to help you discover different papers. The inverse is also true - search engines with different ranking algorithms may return different papers at the top even if they were to have the exact same data as Elicit does.

Figure out where to drill in

Generally, Elicit users today find it helpful as a starting point. They may have a question, but not the best keywords. Some papers might seem relevant, but they may not know whether digging into them would involve getting stuck at a local optimum. Overall, there are way more papers than any of us could ever read in an ideal world. Elicit can help with the prioritization decision by showing information about papers and letting you sort or filter by that information.

Overall reflections

There are short-term limitations that we expect to overcome as we continue working on Elicit, such as better coverage of papers, higher accuracy of Elicit-generated answers, etc.

But there are also fundamental questions that we will probably wrestle with for a very long time. These are questions like:

What is the right balance between “leave it to the experts” and “don’t be a gatekeeper”? Between “clarity and accessibility” and “comprehensiveness and nuance”?
What types of work is necessary for developing expertise, and what types are time-sinks resulting from imperfect tools that we’ve just come to accept?

It's unlikely that we’ll manage to always get this right. We chose to build a research assistant for many reasons, one of which is because researchers are a particularly rigorous and skeptical group. Researchers keep us honest and provide detailed feedback. There is already a wealth of knowledge about research methodologies and best practices that we have been learning from.

We’d really like Elicit to be a tool that we build together, a tool where you see the tangible impact of your feedback on every page. Many of you have already spent so much time with us sharing your screens, showing us your notes, sending us your papers, and finding ways Elicit can be better.

Seriously, thank you so much. It has not stopped blowing our minds that people are so encouraging and helpful.

The Plan for Elicit

Andreas Stuhlmüller and Jungwon Byun — Fri, 08 Apr 2022 00:00:00 GMT

Ought is an applied machine learning lab. We’re building Elicit, the AI research assistant. Our mission is to automate and scale open-ended reasoning. To get there, we train language models by supervising reasoning processes, not outcomes. This is better for reasoning capabilities in the short run and better for alignment in the long run.

In this post, we review the progress we’ve made over the last year and lay out our plan.

Progress in 2021:

We built Elicit to support researchers because high-quality research is a bottleneck to important progress and because researchers care about good reasoning processes.
We identified some building blocks of research (e.g. search, summarization, classification), operationalized them as language model tasks, and connected them in the Elicit literature review workflow.
On the infrastructure side, we built a streaming task execution engine for running compositions of language model tasks. This engine is supporting the literature review workflow in production.
About 1,500 people use Elicit every month.

Roadmap for 2022+:

We expand literature review to digest the full text of papers, extract evidence, judge methodological robustness, and help researchers do deeper evaluations by decomposing questions like “What are the assumptions behind this experimental result?”
After literature review, we add other research workflows, e.g. evaluating project directions, decomposing research questions, and augmented reading.
To support these workflows, we refine the primitive tasks through verifier models and human feedback, and expand our infrastructure for running complex task pipelines, quickly adding new tasks, and efficiently gathering human data.
Over time, Elicit becomes a general-purpose reasoning assistant, transforming any task involving evidence, arguments, plans and decisions.

Table of Contents

How we think about success

Our mission is to automate and scale open-ended reasoning. If we can improve the world’s ability to reason, we’ll unlock positive impact across many domains including AI governance & alignment, psychological well-being, economic development, and climate change.

As AI advances, the raw cognitive capabilities of the world will increase. The goal of our work is to channel this growth toward good reasoning. We want AI to be more helpful for qualitative research, long-term forecasting, planning, and decision-making than for persuasion, keeping people engaged, and military robotics.

Good reasoning is as much about process as it is about outcomes. In fact, outcomes are unavailable if we’re reasoning about the long term. So we’re generally not training machine learning models end-to-end using outcome data, but building Elicit compositionally and inspired by human processes.

In the short term, supervising process is necessary for AI to help with tasks where it’s difficult to evaluate the work from results alone. In the long term, process-based systems can avoid alignment risks introduced by end-to-end training.

Success for us looks like this:

Elicit radically increases the amount of good reasoning in the world.
1. For experts, Elicit pushes the frontier forward.
2. For non-experts, Elicit makes good reasoning more affordable. People who don’t have the tools, expertise, time, or mental energy to make well-reasoned decisions on their own can do so with Elicit.
Elicit is a scalable ML system based on human-understandable task decompositions, with supervision of process, not outcomes. This expands our collective understanding of safe AGI architectures.

Because we’re betting on process-based architectures, these two success criteria are fundamentally intertwined.

Progress in 2021

Start with research

We’ve decided to start by supporting researchers for the following reasons:

Research matters: Impact in many domains is gated by high-quality research, and the world may get even more complex and difficult to reason about in the coming decades. With language models, we can scale best practices from research beyond human capacity. Language models can read and evaluate more research, evidence, and reasoning steps than humanly possible.
Researchers care about reasoning: Researchers have relatively high bars for what good reasoning entails, and more established practices for how to do it. We want to learn what they know.
Research is often process-based: Researchers often care as much about process as outcome, making this a good domain in which to develop process-based architectures. When we work on automating literature review, we don’t collect many examples of research questions and literature reviews and train a neural net end-to-end to produce legit-looking reviews. Instead, we study human experts, understand and decompose their reasoning, and build models that design, execute, and compose research steps like these experts.
Democratize high-quality reasoning: Researchers are able to do more expensively what many people can only afford to do “cheaply”. Researchers have more time, expertise, and tools to carefully study the best answer to different questions. People or organizations may need the same findings to make decisions, but often don’t have those resources. Language models can make best practices accessible to nonexperts.
Learn from the non-expert/expert gap: The gap between a domain expert and novice in research may help us understand the dynamics between transformative AI and humans in the years to come. If we can figure out how to help less informed humans reproduce the judgments of more informed ones, we may be able to learn lessons about how humans can supervise advanced AI systems (“sandwiching”).

We’re studying researchers and how they discover, evaluate, and generate knowledge. Within research, we chose an initial workflow (literature review, mostly for empirical research) and will expand to other workflows and question types. Eventually, we’ll surface the building blocks of many cognitive tasks so that users can automate their own reasoning processes.

Support broad literature reviews

Today, Elicit uses language models to automate parts of literature review, helping people answer questions with academic literature. Researchers use Elicit to find papers, ask questions about them, and summarize their findings.

We started with the literature review workflow for a few reasons:

Pain point: Researchers said that finding and processing literature was their greatest pain point. Before we launched literature review, around 30% of the researchers signing up for Elicit said that the thing they most need help with is literature review. This was by far the largest category of need.
Status quo before frontier: Literature review is a way to understand the current state of research. Understanding the status quo is a near-prerequisite for expanding the frontier.
Process-based: There is a rich discipline around processes to synthesize literature. The systematic review process is a vetted, explicit process designed to reproducibly identify and aggregate research. That process taught us how to evaluate literature and gave us benchmarks to compare against our progress.

The literature review workflow in Elicit composes together about 10 subtasks, including:

Search: Given a search term, Elicit uses language models to rank which papers are most likely to answer a user’s question, even in cases where there is no overlap in keywords.
Summarization and rephrasing: Elicit reviews the abstracts of the most relevant papers and does its best to say for each abstract how it would answer the user’s question in one short sentence. This summary is typically more concise and relevant than any one sentence in the abstract.
Classification: When users ask yes/no questions, Elicit predicts whether the abstract’s answer is more likely to be “Yes” or “No”. Similarly, Elicit identifies which of the papers shown are randomized controlled trials, systematic reviews, metaanalyses, etc.
Extraction: Elicit automatically extracts key information from abstracts, such as sample population, study location, intervention tested, and outcome measured. User can ask custom questions about the abstracts as well.
Critique: Elicit looks through all citations to a paper and surfaces the ones that are most likely to contain methodological critiques.

Outside of the literature review workflow, versions of some of these subtasks also exist independently on Elicit and researchers find them useful.

Establish a user base

Elicit is still early. We’ve spent about seven months building the literature workflow. Its impact on helping the world reason better, and on demonstrating a process-based ML architecture, is understandably small. Nevertheless, we’re excited about the reception so far and the potential to significantly scale its impact over the coming years.

Over 1,500 people use Elicit each month. Over 150 people use Elicit for more than 5 days each month (~ once a week). 60% of users in a month are returning users, people who used Elicit in a previous month and found it worth using again. In our February feedback survey, 45% of respondents said they would be “very disappointed” if Elicit went away. (Tech companies try to get this to 40%.) Elicit has been growing by word of mouth, and we expect to continue growing organically while we focus on making Elicit useful.

Today’s users primarily use Elicit to find papers and define research questions at the start of their research projects. 40% of respondents to our February feedback survey shared that they most want Elicit to help them with these tasks, and that Elicit is more useful for these tasks (7.8 and 7.1 out of 10) than for the others we asked about.

Elicit users also want help understanding paper contents and conducting systematic reviews, but Elicit was less helpful there at the time. (Understanding paper content is now a Q2 priority.)

Some of our most engaged researchers report using Elicit to find initial leads for papers, answer questions, and get perfect scores on exams (via Elicit Slack). One researcher used a combination of Elicit literature review, rephrase, and summarization tasks to compile a literature review for publication. Our Twitter page shows more examples of researcher feedback and how people are using Elicit.

At least 8% of users are explicitly affiliated with rationality or effective altruism, based on how they heard about Elicit or where they work. We also worked closely with CSET, whose researchers cited Elicit in three publications (Harnessed Lightning, Wisdom of the Crowd as Arbiter of Expert Disagreement, Classifying AI Systems).

In sum, people are using Elicit regularly and recommending it to others. We take this as a sign that Elicit is creating value. We’re excited for the day when we can make stronger claims about the impact Elicit is having on people’s reasoning. We plan to experiment with different evaluations of Elicit’s impact. Some ideas we’ve had in this direction:

Have people of different levels of expertise use Elicit to generate a systematic review and see if they are indistinguishable from expert-generated systematic reviews (and ideally take less time).
Use Elicit to update, then replicate, then generate high-quality research like Givewell intervention reports.
Run a randomized controlled trial with Elicit as the intervention for different research or reasoning tasks.

Build infrastructure for process-based ML

Because Elicit is a process-based architecture, we need to get good at running complex task pipelines and at making sure the individual tasks within the pipelines are reliable. We’ve made progress on both fronts over the past year.

Running complex task pipelines

We’ve built a task graph execution framework for efficiently running compositions of language model tasks. The framework is used to run literature review tasks and is likely one of the most compositional uses of language models in the world. Elicit engineers only need to specify how tasks depend on other tasks (e.g. claim extraction depends on ranking), and the scheduling and execution across compute nodes happen automatically.

The execution engine runs the graph of tasks in parallel as efficiently as allowed by the dependency structure of the workflow graph. While running, the executor streams back partial results to the Elicit frontend. Because language models are relatively slow (more than one second per query for the largest models), parallelism and sending partial results both matter for a good user experience.

Finetuning individual tasks

To get good overall answers, we also need individual primitive tasks to be robust. In a project in Q4 2021, we focused on generating one-sentence answers based on abstracts as a case study. When a researcher asks a question, Elicit finds relevant papers, reads the abstracts, then generates a one-sentence summary of the abstract that answers the researcher’s question. These summaries are often more relevant to the researcher’s specific question than any one of the sentences in the abstract.

With few-shot learning, we found that the claims were often irrelevant, hard to understand, and sometimes hallucinated, i.e. not supported by the abstract. This is a case of “capable but unaligned.” GPT-3 has the entire abstract, which contains all of the information it needs to generate a summary answer. We’re confident that GPT-3 is capable of generating such answers—it could even just pick the most relevant sentence and return it word for word. Nonetheless, it sometimes made things up.

As one of the first users of GPT-3 finetuning, we switched from few-shot learning to a finetuned claim generation model. This made the claims more relevant and easier to understand, but initially made hallucination worse. Through a sequence of finetuning runs on increasingly higher-quality datasets, we reduced hallucination without making claims less relevant. We still haven’t fully solved this problem. We expect that our upcoming work on verifier models, decomposition, and human feedback will help.

Roadmap for 2022+

This roadmap highlights the most important themes for Elicit over the next years. A more fleshed-out roadmap is in this doc.

Evaluate papers in depth through decomposition

To date, we’ve focused on making Elicit useful for getting a broad overview of a research space, surveying many papers. Next, we will help researchers as they go deep into individual research papers and use those subtasks to support more complex reasoning.

Over the next months, we’ll work on projects like:

Study bespoke evidence reviews and make Elicit more useful for those.
Help evaluate methodological quality and robustness of papers by breaking down this evaluation into multiple component factors. This can look like:
- Identify how people predict whether or not a paper will replicate and make those calculations easy in Elicit.
- Identify the components of risk of bias analysis and make those available in Elicit.
- Evaluate claims via belief propagation on weighted citation networks.
Enable question-answering across the full text of papers.
Let users ask follow-up questions to Elicit-generated answers, using Elicit-generated answers as additional context.
Prototype AI safety via debate in the context of scientific claims.

As we help users with more complex reasoning, we’ll need to get better at automatic decomposition, aggregating the results of subtasks, and understanding what users are really looking for. This will make Elicit more useful for more complex research (differential capabilities) and shed light on the feasibility of process-based architectures (alignment).

Here are two examples of how Elicit might automatically decompose complex tasks:

Elicit factors a question

Researcher asks a complex, underdefined question about a paper e.g. “What was the effect?”
Elicit automatically decomposes this question into subquestions like:
1. What are all of the population-intervention-outcome permutations studied in this paper?
  1. What are all of the populations studied?
  2. For each population, what are all of the interventions studied?
  3. For each population-intervention pair, what are all of the outcome measures?
2. What was the effect size for each permutation?
3. Was the effect size significant?
Elicit then summarizes these findings back for the user, or only presents the relevant subset.

Elicit factors a research process

Researcher asks a research question e.g. “How does caffeine affect longevity?”
Elicit generates a literature review process for the specific question e.g.:
1. Run a semantic search to find the top 1000 papers that are most likely to have information relevant to caffeine and longevity, even if only indirectly
2. Limit to studies published after 2015
3. Read the titles and abstracts of the post-2015 papers to find the most relevant randomized controlled trials related to longevity
4. Identify sample size in abstract and limit to studies with at least 100 participants
5. Identify sample population in abstract and limit to studies on human participants
Researcher can provide feedback on the proposed process. They might want to tweak the query, choose a different corpus of papers, adjust sample size thresholds, or add other steps.
Once Elicit completes the tasks, the researcher can take a look at the overall results, and zoom into individual steps if they wish.

Support many research workflows

Right now, Elicit works best for questions about empirical research. Those tend to be questions of the style “What are the effects of X on Y?”, including questions about randomized controlled trials in biomedicine, social science, or economics.

Starting in late 2022, we want to move beyond literature review for empirical questions and let users automate custom workflows, initially within research. Elicit will become a workspace where users can invoke and combine tasks like search, classification, clustering, and brainstorming over datasets of their choice, with different models and interfaces.

For example, researchers might want to search over their own corpus from a reference manager, extract all of the outstanding research directions from the papers they’ve curated, rephrase them as questions, then search those questions over academic databases to see if any of them have been worked on.

They might connect their personal notetaking apps, classify all of the notes about papers, then train a model to watch the literature and notify them if new papers addressing any of their cruxes are published.

To ensure users have the tools they need to design their personal research assistants, we’ll work on projects like:

Expand to more information sources (Wikipedia, think tank publications, personal notes)
Make Elicit useful for other types of questions e.g.
1. What are the best examples of X?
2. What are the best arguments / evidence for / against X?
3. How does X work?
4. What is the future of X?
5. How much research has been done on X?
6. How do you do X?
Study research processes and build the UX and infrastructure improvements they require e.g.
1. Extract open research directions from the literature
2. Identify which research directions many people are interested in but where no good papers exist
3. Organize concepts & arguments in a space

Refine the primitive tasks

We’ll keep refining the core subtasks underlying many research workflows. This entails both task-specific work, such as building out search infrastructure for academic articles, as well as general-purpose human feedback mechanisms.

One of our biggest projects right now is building a semantic search engine for 200 million abstracts and 66 million full-text papers using language model embeddings.

On the human-feedback side, we’ll apply and contribute to methods for alignment. For example:

Applying verifier models within Elicit. We want to try using a classifier to identify whether the summary is supported by the abstract, whether Elicit’s guess at the dosage applied in the study is correct, etc.
Updating responses with user feedback. We’re excited about mechanisms that let users highlight where models fail, and about providing immediately improved answers when users highlight failures.

When we run into problems automating a task, we always want to understand whether this is because of limited data or limited model capacity. We are confident that model capacity will improve over time, and are primarily concerned with providing the data and training objective that will make good use of the available capacity at any point in time.

Expand our infrastructure for process-based ML

In the ideal world, the only constraint for new workflows is the compute time for running language models. To compete with end-to-end training, running new workflows using decomposition needs to have near-zero friction. This requires that we can run complex task pipelines, add new tasks with little effort, and efficiently gather human demonstrations and feedback.

Run more complex task pipelines

We’ll build the infrastructure to execute very large graphs of tasks and deal with the challenges that come up in this setting, such as:

Building models that make good choices about which tasks to run next
Choosing the right level of abstraction for decompositions
Avoiding accumulation of small errors by running human-understandable error correction processes and sanity checks, delegating reasoning about error propagation to the system
Using red teaming and interpretability methods to avoid catastrophic failures

Add new tasks with little effort

Adding new primitive tasks is labor-intensive. We need to think about what data is needed, create gold standards, collect finetuning data from contractors, evaluate model results using contractors, and use our judgment to improve instructions for contractors.

In the ideal world, we would just say "categorize whether this study is a randomized controlled trial" and an elegant machine involving copies of GPT-k, contractors, etc, would start up, generate a plan for accomplishing this task, critique and improve the plan, and execute it without any intervention on our part.

To get to this world:

We record all the steps from wanting to add a new task to having it up and running
We analyze what the bottlenecks are. Generating the data? Finding the model to train? Setting up the right objective?
For each bottleneck, we think about how to speed up the relevant decisions

Efficiently gather human demonstrations and feedback

Given a new task that models can't do out of the box, we need efficient mechanisms for gathering human demonstrations, using both a scalable contractor workforce and Elicit users. This is less distinctive to Elicit since everyone who trains models on human demonstrations and feedback has to cope with it. We are aiming to outsource as much of it as we can, but it is an important ingredient nonetheless.

Cases where users can provide good feedback but contractors naively can't are particularly interesting because they let us test how we can get feedback and demonstrations for tasks where it's hard to get good human oversight. They are a test case for the future where we want to accomplish tasks for which neither contractors nor users can provide feedback directly.

From research assistant to reasoning assistant

Zooming out, our milestones for the next few years are:

2022: Elicit is the best literature review assistant and demonstrates complex automated reasoning through decomposition
2023: Elicit automates many research workflows: Exploration, planning, reading, and others
2024: Users automate custom research workflows with Elicit
2025+ Elicit transforms any task involving evidence, arguments, plans and decisions

We’re starting by studying a group of researchers, who are thoughtful about how they discover and evaluate information and who have high standards of rigor. We’ll design Elicit to replicate their processes, using language models to apply them at a greater scale than humanly possible.

Eventually, we’ll make these research best practices available even to non-experts, to empower them when interacting with experts or making life decisions. We’ll support a diverse set of research workflows, then other workflows beyond research.

We’ll develop Elicit compositionally so that the system remains aligned and legible even as the reasoning it supports grows increasingly complex.

Today, researchers already find Elicit valuable. Yet there is much left to do. We’ve described the work we see ahead of us to get to a world with better reasoning. Join us!

Supervise Process, not Outcomes

Andreas Stuhlmüller and Jungwon Byun — Wed, 06 Apr 2022 00:00:00 GMT

We can think about machine learning systems on a spectrum from process-based to outcome-based:

Process-based systems are built on human-understandable task decompositions, with direct supervision of reasoning steps.
Outcome-based systems are built on end-to-end optimization, with supervision of final results.

This post explains why Ought is devoted to process-based systems. The argument is:

In the short term, process-based ML systems have better differential capabilities: They help us apply ML to tasks where we don’t have access to outcomes. These tasks include long-range forecasting, policy decisions, and theoretical research.
In the long term, process-based ML systems help avoid catastrophic outcomes from systems gaming outcome measures and are thus more aligned.
Both process- and outcome-based evaluation are attractors to varying degrees: Once an architecture is entrenched, it’s hard to move away from it. This lock-in applies much more to outcome-based systems.
Whether the most powerful ML systems will primarily be process-based or outcome-based is up in the air.
So it’s crucial to push toward process-based training now.

There are almost no new ideas here. We’re reframing the well-known outer alignment difficulties for traditional deep learning architectures and contrasting them with compositional approaches. To the extent that there are new ideas, credit primarily goes to Paul Christiano and Jon Uesato.

We only describe our background worldview here. In a follow-up post, we’ll explain why we’re building Elicit, the AI research assistant.

Table of Contents

The spectrum

Supervising outcomes

Supervision of outcomes is what most people think about when they think about machine learning. Local components are optimized based on an overall feedback signal:

SGD optimizes weights in a neural net to reduce its training loss
Neural architecture search optimizes architectures and hyperparameters to have low validation loss
Policy gradient optimizes policy neural nets to choose actions that lead to high expected rewards

In each case, the system is optimized based on how well it’s doing empirically.

MuZero is an example of a non-trivial outcome-based architecture. MuZero is a reinforcement learning algorithm that reaches expert-level performance at Go, Chess, and Shogi without human data, domain knowledge, or hard-coded rules. The architecture has three parts:

A representation network, mapping observations to states
A dynamics network, mapping state and action to future state, and
A prediction network, mapping state to value and distribution over next actions.

Superficially, this looks like an architecture with independently meaningful components, including a “world model” (dynamics network). However, because the networks are optimized end-to-end to jointly maximize expected rewards and to be internally consistent, they need not capture interpretable dynamics or state. It’s just a few functions that, if chained together, are useful for predicting reward-maximizing actions.

Neural nets are always in the outcomes-based regime to some extent: In each layer and at each node, they use the matrices that make the neural net as a whole work well.

Supervising process

If you’re not optimizing based on how well something works empirically (outcomes), then the main way you can judge it is by looking at whether it’s structurally the right thing to do (process).

For many tasks, we understand what pieces of work we need to do and how to combine them. We trust the result because of this reasoning, not because we’ve observed final results for very similar tasks:

Engineers and astronomers expect the James Webb Space Telescope to work because its deployment follows a well-understood plan, and it is built out of well-understood modules.
Programmers expect their algorithms to implement the intended behavior because they reason about what each function and line does and how they go together to bring about the behavior they want.
Archeologists expect their conclusions about the age of the first stone tools to be more or less correct because they can reason about the age of the sediment layer the tools are in. They can estimate the age of the layers by looking at the iron-bearing minerals they contain which reflect the state of the earth’s magnetic polarity at the time they were preserved.

At Ought, we’ve been thinking about scientific literature review as a task where we expect to arrive at correct answers only when it’s based on a good process. When I’m trying to figure out whether iron supplements will help me or hurt me, I might start by following a process like this:

Clarify the question I’m trying to answer—what kind of iron, what kinds of supplements, what benefits am I hoping for? How will I decide whether to take the supplement or not?
Search for a list of candidate papers using the question and related search terms
For each study I find, answer:
1. Does it address the question I’m interested in, or a closely related question? Was the population studied similar to me?
2. Is it a randomized controlled trial, or a meta-analysis of trials?
3. Is the risk of bias below the threshold I’d accept? Are there no glaring critiques of the study or methodological limitations?
Throw out studies for which the answer isn’t yes to all questions
If any studies remain, synthesize them into a summary answer that explains the observed evidence
If not, relax my question and go back to 2

Of course, this is far from a great process. For a slightly better example, see this systematization of Scott Alexander’s post on Indian Economic Reform.

To build a process-based system, the fundamental problem to solve is to reduce the long-horizon tasks we care about to independently meaningful short-horizon tasks (factored cognition). If we can do that, we can then generate human (or human-like) demonstrations and feedback for these sub-tasks.

This reduction to subtasks can be done by the system designer, or for better scalability on-the-fly by the system itself. Task decomposition is another subtask, after all.

In between process and outcomes

Many tasks can be approached in both ways, and in practice, most systems will likely end up somewhere in between. Examples:

Search engine:

Outcome-based: Embed documents and metadata in a vector space, same for queries. Use a neural net retriever. Optimize retriever parameters and embeddings for users giving high ratings to the retrieved documents.
Process-based: Define an idealized process for evaluating the quality of a search result for a given query, e.g. decomposing the evaluation of the question “Is this result trustworthy?” into Pagerank-style considerations, questions like “Is the author an expert in this field?”, etc. Distill each subquestion module into a neural net so that we can execute it at runtime.
In between: Start with the process-based approach, but make a few choices by fitting user scores, e.g. fitting parameters in a tiny MLP that mixes feature weights.

Question-answering:

Outcome-based: Train a neural net to map questions to answers, perhaps using Retro-style end-to-end-optimized retrieval.
Process-based: Independently train neural nets that map questions to web search queries, query responses to relevant extracts, and long answers to summary answers, each trained on human demonstrations or feedback.
In between: Follow the process-based approach, but don’t imitate human queries; instead, just learn query strategies that lead to highly rated final answers end-to-end. (This is WebGPT.)

Business decision advisor:

Outcome-based: Train a MuZero-style neural net on making decisions about trades, product launches, and hiring decisions based on business returns, or other long-term metrics of interest, optimizing for actions that look good in hindsight.
Process-based: Imitate actions that look good to a human supervisor in foresight, giving the human supervisor AI tools to do a thorough ex-ante evaluation.
In between: Imitate human actions chosen ex-ante, but use a black-box predictor for long-term metrics to choose between several actions that all look similarly good in foresight.

Eric Drexler’s CAIS paints a picture of AI that is also somewhere between process and outcomes in that AI services have clearly defined roles on a larger scale, but are individually outcome-based.

It’s better to supervise process than outcomes

Why prefer supervision of process? If we don’t need to look at outcomes, then:

We can do well at long-horizon tasks where outcomes aren’t available (better differential capabilities)
We don’t run the risk of our outcome measures being gamed (better alignment)

Differential capabilities: Supervising process helps with long-horizon tasks

We’d like to use AI to advance our collective capability at long-horizon tasks like:

Multi-year and multi-decade forecasting, e.g., predicting long-term consequences of vaccines
Policy and governance, especially AI policy
Personal and institutional planning and decision-making
AI alignment research

Unfortunately, gathering outcome data is somewhere between expensive and impossible for these tasks. It’s much easier to gather data and exceed human capability at short-horizon tasks:

Keeping people engaged as they interact with videos and posts
Developing physical technologies, e.g., new toxic molecules
Persuading people in short conversations
Predicting 30-minute consequences, not 30-year consequences

In a world where AI capabilities scale rapidly, we need AI to support research and reasoning that is likely to make AI go better. This includes guiding AI development and policy, helping us figure out what’s true and make plans as much as it helps us persuade and optimize goals with fast feedback loops and easy specifications.

If we can reliably reduce such long-horizon tasks to short-horizon tasks, we’ll be better positioned to deal with the incremental development and deployment of advanced AI.

Alignment: Supervising process is safety by construction

With outcome-based systems, we’ll eventually have AI that is incentivized to game the outcome evaluations. This could lead to catastrophes through AI takeover. (Perhaps obvious to most readers, but seems worth making explicit: A big reason we care about alignment is that we think that, from our current vantage point, the world could look pretty crazy in a few decades.)

What is the endgame for outcome-based systems? Because we can’t specify long-term objectives like “don’t cause side-effects we wouldn’t like if we understood them”, we’re using proxy objectives that don’t fully distinguish “things seem good” from “things are good”. As ML systems get smarter, eventually all of the optimization effort in the world is aimed at causing high evaluations on these proxies. If it’s easier to make evaluations high by compromising sensors, corrupting institutions, or taking any other bad actions, this will eventually happen.

Suppose instead that we understood the role of each component, and that each component was constructed based on arguments that it will fulfill that role well; or it was constructed and understood by something whose behavior we understood and constructed to fulfill its role. In that case, we may be able to avoid this failure mode.

This is closely related to interpretability and reducing risks from inner alignment failures:

If we can limit the amount of black-box compute, and the amount of uninterpretable intermediate state, we’re in a better position to know what each model component is doing. We view this type of progress as complementary with Chris Olah’s work on interpretability and ELK-style proposals for learning what models know. The better we are at decomposition, the less weight rests on these alternatives.
Inner alignment failures are most likely in cases where models don’t just know a few facts we don’t but can hide extensive knowledge from us, akin to developing new branches of science that we can’t follow. With limited compute and limited neural memory, the risk is lower. Advancing process-based systems is helpful on the margin, even if we can’t fully eliminate outcome-based optimization.

In the long run, differential capabilities and alignment converge

Today, differential capabilities and alignment look different. Differential capabilities are starting to matter now. Alignment is a much less prominent issue because we don’t yet have AI systems that are good at gaming our metrics.

In the crazy future, when automated systems are much more capable and make most decisions in the world, differential capabilities and alignment are two sides of the same coin:

We either can’t use AI for most tasks we care about if all we know is how to design outcome-based architectures (lack of capabilities), or
We have highly effective systems optimizing for flawed objectives, which can lead to catastrophic outcomes (misalignment)

People sometimes ask: Is Ought working on differential capabilities (making ML useful for supporting reasoning) or alignment (avoiding risks from advanced AI)? From the perspective of intervening by advancing process-based systems, these two causes are fundamentally tied together.

Two attractors: The race between process- and outcome-based systems

Outcome-based optimization is an attractor

In some sense, you could almost always do better through end-to-end training, at least according to any one metric. You start with a meaningful task decomposition, track a global metric, and then backpropagate to make the system better along that metric. This messes with the meaning of the components and soon, they can’t be interpreted in isolation anymore.

We expect that, at some point, there will be strong pressure to optimize the components of most digital systems we’re using for global metrics. The better we are at building process-based systems, the less pressure there will be.

Process-based optimization could be an attractor, too

The good crazy future is one with an ecosystem of AIs made out of components with roles that are in principle human-understandable, with each component optimized based on how well it accomplishes its local role.

Advanced process-based systems could self-regulate to remain process-based, which makes them a local attractor:

Whenever an action is chosen within the process-based system, it comes from an action suggester along with reasoning for why it’s good for the system to implement this action.
This suggester could propose to make local changes, like changing some weights, just because empirically they’ll improve the quality of overall results along some metric, even if it makes the system less modular and interpretable.
This proposal and the reasoning for it would then get evaluated by another part of the system that looks for errors and catches and fixes them before they matter.
This evaluator would evaluate the costs and benefits of implementing the proposal and reject it because it would not maintain the invariant that each component has a clear role that makes sense independent of the global objective.

This story makes the basin of attraction around process-based systems look a lot more narrow than the basin around outcomes: It only applies to individual systems, and it assumes that there is a fairly bright line between components that have a clear individual role and those that don’t.

The state of the race

Today, process-based systems are ahead: Most systems in the world don’t use much machine learning, and to the extent that they do, it’s for small, independently meaningful, fairly interpretable steps like predictive search, ranking, or recommendation as part of much larger systems.

However, the history of machine learning is the bitter lesson of outcomes winning. Vision and NLP started with more structured systems, which were replaced with end-to-end systems. In these areas, the structured systems are much worse, and we don’t know how to make them competitive on standard benchmarks. Deepmind and OpenAI have better infrastructure for running RL on outcome-based metrics than for collecting process-based feedback. They tend towards a “research aesthetic” that favors outcomes-based approaches even in cases where they work worse.

Overall, it’s up in the air which tasks will be solved in which way. Some parts of the AI community are leaning toward process, others toward outcomes. If we see impressive results from process-based feedback, institutional knowledge and research tastes may shift toward process-based systems. Future norms and laws, perhaps similar to existing algorithmic transparency laws, might strengthen this position.

We don’t need process-based systems to be a perfect attractor. If most systems are largely process-based around the time of transformative AI, with small amounts of outcome-based optimization, we’re likely in good shape.

Conclusion

If we run into trouble with early advanced AI systems, it will likely be clear that supervision of process would be better than supervision of outcomes. At that point, the question is whether we’re good enough at process-based systems that they’re a realistic option. If so, then for the most important and high-stakes use cases, people will likely switch. This requires that we develop the relevant know-how now.

Beyond AI, we view understanding how to build systems and institutions that make correct decisions even when outcomes aren’t available as part of a broader agenda of advancing reason and wisdom in the world. Making mistakes about the long-term consequences of our short-term decisions is one way we fall short of our potential. Making wise decisions in cases where we can’t easily learn from our failures is likely key to living up to it.

Appendix

The crazy future

What "crazy" means:

AI systems are doing most economically valuable tasks in the world. They’re developing, producing, and shipping new products. They’re writing code, running datacenters, and developing new technologies. They’re influencing policy to some extent.
An increasingly large part of the world economy is AI development, more than shows up explicitly because all fields depend on AI now. The AI industry is worth many trillions of dollars.
As more of the world economy depends on AI, the value of further improvements to AI increases. It is hard to scale up human researchers and programmers working on AI. Automation of AI research is one of the most important application areas of AI—rolling out AI in new domains, making existing applications better, improving hardware, software, and data centers.
Much of this activity happens without humans in the loop. It’s a complex economy of AI systems.

This transition to an AI-run economy could be centralized in one or a few firms, or involve many firms, each specializing in different roles. It could take two decades, or five, and the path there could be more or less continuous. Either way, we think it's likely that the world within our lifetime will look very different from today’s world in ways that will be obvious to everyone.

Comments

For comments, see the version of this post on Lesswrong.

Acknowledgments

Thanks to Paul Christiano and Jon Uesato for relevant discussions, and Jon Uesato, Owain Evans, James Brady, Ben Rachbach, and Luke Stebbing for feedback on a draft.

Building Elicit, the AI research assistant

Andreas Stuhlmüller — Tue, 22 Mar 2022 00:00:00 GMT

Over the last year we've been heads down building Elicit, the AI research assistant. In this post we briefly review where Elicit is at now, our plans, and the case for Elicit. For future updates, follow the Elicit mailing list.

Table of Contents

What Elicit looks like

Our plans for Elicit

Our goal is to automate and scale open-ended reasoning with language models—synthesizing evidence and arguments, designing research plans, and evaluating interventions.

We’re starting with automating literature reviews because:

There is a rich discipline around synthesizing literature.
Understanding the status quo is necessary to expand the frontier.
Researchers most want help with literature review.

Today, Elicit users find academic papers, ask questions about them, and summarize their findings.

After literature review, we’ll expand to other research tasks (evaluating project directions, decomposing research questions, augmented reading), then beyond research (supporting organizational planning, individual decision-making).

Recent work

From the Elicit mailing list archive, our updates for the last month:

The case for Elicit

Robust, well-reasoned research is the bottleneck for many impactful interventions and decisions. Language models can address this bottleneck by reading and evaluating more research, evidence, and reasoning steps than humanly possible.

Like programming languages provide building blocks for exact computation, language models can provide the building blocks of cognitive work (e.g., search, extraction, classification, summarization). With Elicit we plan to study researchers, identify and build out these blocks, then surface them to users so that they can string them together and automate their cognitive workflows over time.

If we succeed, we will make researchers vastly more productive and accurate. We will also help non-experts apply good research and reasoning practices when discovering, consuming, and generating information.

Elicit's architecture is based on factored cognition, the composition of small pieces of independently meaningful pieces of cognition. While we’re building this architecture in the context of a research assistant, we expect to learn how to make machine learning useful for open-ended questions more broadly. In the long run, this can avoid some alignment risks posed by end-to-end optimization. First, end-to-end training doesn't work well for exceeding human capability at questions that don't have easily measurable outcomes, questions like "Does this plan have problematic long-term consequences?". If we want AI to be as helpful for such long-horizon tasks as it is for "Did this chat interaction persuade them to click 'buy'?", we need a paradigm that isn't based on end-to-end training.

Second, as AI becomes more powerful, AI systems trained end-to-end are incentivized to game their reward metrics. The compositional approach evaluates process instead of outcome, thus providing a more robust alternative.

Read more updates in the [Elicit mailing list archive](https://list.elicit.org/) and try [Elicit](https://elicit.org).

Automating reasoning about the future at Ought

Jungwon Byun and Andreas Stuhlmüller — Mon, 09 Nov 2020 00:00:00 GMT

Ought’s mission is to automate and scale open-ended reasoning. Since wrapping up factored evaluation experiments at the end of 2019, Ought has built Elicit to automate the open-ended reasoning involved in judgmental forecasting.

Today, Elicit helps forecasters build distributions, track beliefs over time, collaborate on forecasts, and get alerts when forecasts change. Over time, we hope Elicit will:

Support and absorb more of a forecaster’s thought process
Incrementally introduce automation into that process, and
Continuously incorporate the forecaster’s feedback to ensure that Elicit’s automated reasoning is aligned with how each person wants to think.

This blog post introduces Elicit and our focus on judgmental forecasting. It also reifies the vision we’re running towards and potential ways to get there.

Table of Contents

Judgmental forecasting today

What is judgmental forecasting?

Judgmental forecasting refers to forecasts that rely heavily on human intuition or “qualitative” beliefs about the world. Forecasts on prediction platforms such as the Good Judgement Open and Metaculus tend to be judgmental forecasts. Example questions include:

Judgmental forecasting distinguishes itself from statistical forecasting, which uses extrapolation methods like ARIMA. We need judgmental forecasting when we don’t have the right data required to train a model. This generally includes questions about low-frequency events (e.g. transformative technology, geopolitical events, or new business launches) and agent-based reasoning (e.g. business competitor behavior).

In an effort to communicate this fuzzy spectrum more concretely, we’ll share an imperfect visualization. We highlight in blue the types of reasoning we want to support first.

On the left, we have revenue forecasting at Google, where algorithms predict ad revenue in 30 second increments. We don’t plan on supporting this type of reasoning in the foreseeable future.

A bit to the right from that we have projects like Ajeya Cotra’s Draft report on transformative artificial intelligence timelines. Ajeya needed to gather data and model the trajectories of hardware prices, spending on computation, and algorithmic progress (among others), but she also used qualitative reasoning to decompose the question into compute requirements, compute availability, and so on. The experts she elicited predictions from did not build their own models, but made parameter estimates based on their prior research and expertise. We expect to be useful for parts of such projects.

Further to the right, we have Alex Beal’s decomposition of whether Roe v. Wade will get overturned conditional on President Trump nominating a new Supreme Court Justice. Alex uses probabilistic reasoning here, but his overall decomposition is world-model based. His probabilities are not extrapolated from data but from his beliefs about Trump and the United States Supreme Court. We plan to be the most useful tool for this type of reasoning.

[![Roe vs Wade decomposition](/images/blog/2020-11-09-forecasting/roe-wade.png "Roe vs Wade decomposition")](/images/blog/2020-11-09-forecasting/roe-wade.png)

Reality is not as linear as the table above suggests: the different examples of reasoning are not strictly more qualitative going right. Nor are types of reasoning as discrete as the graphic suggests. In practice, people often use both quantitative and qualitative reasoning. Regardless, we hope this clarifies the types of reasoning Ought focuses on.

Why should we automate judgmental forecasting?

Forecasting underpins almost all decision-making. Often, a decision is a pair of conditional forecasting questions in disguise: “Should we spend $10 million on ads this month?” breaks down into “How much revenue will we make if we spend $10 million on ads this month?” and “How much revenue will we make if we don’t buy ads?” Organizations use pairs of conditional forecasts like these to isolate the marginal impact of ads on revenue and decide whether that’s worthwhile.

The importance of complex human reasoning couldn’t be more obvious today: the coronavirus pandemic has changed our society so dramatically that we can’t rely on past data to predict the near future. Government task forces found at times that all of the covid-19 prediction models were wrong, and had to resort to averaging them. Chief Financial Officers at public companies like Autodesk similarly found that “the current state renders all previous models useless.” Chief Marketing Officers at nimble startups like Nurx feel like throwing their computers out the window when trying to forecast demand.

In these unprecedented times, human judgment can step in to help counties like El Paso, Texas predict covid infection peaks more accurately than purely quantitative models.

Beyond global pandemics, judgemental forecasting can help the intelligence community anticipate elections, disease outbreaks and geopolitical dynamics. In conjunction with more quantitative modeling, it helps non-profits like Rethink Priorities estimate how much their donors will give next year. It’s necessary for organizations like the Long Now Foundation or Open Philanthropy, who want to prepare for the long-term future.

With Elicit, Ought aims to scale up the reasoning that happens in judgmental forecasting. We want to make it incredibly easy to produce good forecasts, enabling a wider range of people, companies, and teams to forecast things they don’t even imagine to be forecasting questions today. Serious forecasting should not be limited to those who can afford to pay trained forecasters. Eventually, “Will this project launch on time?” will feel as easy as figuring out the weather next week.

Human work doesn’t get us to that scale. We need to build a system that trains machines to think in the way we would if we knew more, were wiser, and had more time. Yet, we don't currently have compelling proposals for how to train machine learning systems to help people answer hard qualitative questions. Language models are usually trained with imitation learning, which probably won’t scale to significantly surpass human abilities. Reinforcement learning requires fast feedback loops and will be hard to apply to long-term forecasts that don't have this sort of feedback. To exceed human performance, we'll likely need to combine imitation learning or reinforcement learning with yet unproven approaches such as factored cognition, factored evaluation, or debate. So, in addition to being valuable in its own right, automating judgmental forecasting is a proving ground for aligned delegation of thinking more generally.

Judgmental forecasting automated

Where are we going?

Elicit is a tool for judgmental forecasting. People use Elicit to build, save, and collaborate on predictions. They also use Elicit to get alerts when prediction markets change their minds about the future.

Today, users do most of the work and Elicit automates parts of the forecasting workflow. Over time, Elicit will not just automate workflow, but increasingly support the reasoning that goes into forecasts. It will do things like point out inconsistent beliefs, suggest additional considerations, and guess at what the user really meant by their question.

This aspirational demo illustrates our current long-term vision:

In our vision, Elicit learns by imitating the thoughts and reasoning steps users share in the tool. It also gets direct feedback from users on its suggestions. Elicit progressively guesses more complex parts of the thought process, until it ends up suggesting entire decompositions, models, or explanations. As Elicit’s work gets more sophisticated, users can still dig into subcomponents of Elicit’s reasoning to evaluate parts even when they can’t evaluate the entire process end-to-end.

Elicit starts with humans doing most of the work and ends with machines doing most of the work. In the end state, the user primarily provides oversight and feedback to an AI system reasoning about the future. Having evolved with the forecaster and their constant feedback, Elicit ends up as a bespoke thought partner to each individual.

How do we get there?

As we showed in our earlier graphic, we want to support two types of reasoning in Elicit:

Qualitative reasoning. The forecaster decomposes the question, structures a model, thinks about the causal relationships in the world, or potential outcomes of an event.
Quantitative reasoning. The forecaster estimates numbers or the probabilistic implications of their qualitative beliefs (they specify likelihoods, distributions, etc.).

The table below shows how Elicit currently supports these two types of reasoning, and how it plans on incrementally automating them going forward.

Qualitative reasoning	Elicit today	Elicit tomorrow
Let users store and share notes about forecasts	•	•
Suggest relevant factors or subquestions influencing a forecast		•
Suggest related existing questions or benchmarks		•
Suggest entire decompositions of a question		•
Quantitative reasoning
Associate probabilities with qualitative beliefs and notes	•	•
Design complex distribution shapes	•	•
Validate beliefs with visualization	•	•
Show new beliefs implied by the user’s stated beliefs	•	•
Express beliefs in natural language		•
Estimate prior distributions		•

Elicit today

Today, Elicit supports qualitative reasoning by letting users add both free-form notes and notes associated with intervals and percentiles. Users can break down a question into smaller components to establish more direct links between beliefs and overall predictions.

Users can track all versions of their predictions on a question. With this history, forecasters get more granular lessons each time a question resolves. Decomposing a prediction into bins, probabilities, notes, and versions isolates areas for future improvement. When they go back to reflect, users learn not just whether their prediction was right or wrong, but more directly whether they missed an important consideration or just overestimated another factor’s influence, for example.

With this same functionality, users can poll other people to get their feedback and notes on a question, as demonstrated by this AI timelines thread and this AI timelines model.

Once forecasters have organized their thoughts, Elicit makes it easy to express them quantitatively as a probability density function. Users can specify percentiles or bins and corresponding probabilities. Both are more accessible than coding a distribution in Python or identifying whether a distribution is lognormal, what the variance is, etc. Elicit is particularly useful for abnormally shaped distributions like this one on Ebola deaths before 2021, truncated at number of deaths to date and this multimodal distribution on SpaceX’s value in 2030.

Users don’t have to specify every part of the distribution, like they would if they were building a histogram in spreadsheets. They also don’t need to keep track of whether bins add up to 100%. Elicit will accept the messiness of overlapping bins and inconsistent beliefs.

With these features, Elicit facilitates a three way conversation among the bins, plot, and Elicit-calculated implied beliefs. As shown in the tutorials above, users enter in their probabilities and double check with the plots and Elicit-provided implied beliefs. They then adjust their bins accordingly. Sometimes users have stronger intuitions about specific probabilities and ranges. Other times, they have stronger intuitions about the overall shape of the curve.

Elicit tomorrow

Elicit today helps forecasters make their thinking explicit. Most of the value comes from giving people a place to organize, store, and share the thoughts they’ve generated on their own. Over time, Elicit will generate more of the thoughts, letting the forecaster play evaluator.

For example, Elicit can integrate language models to operationalize the fuzzy questions forecasters care about into the concrete questions they can measure and predict. It can also use language models to find base rates or datasets to expedite the research process, the most time consuming part of the forecasting workflow.

We can already extract the resolution criteria and data sources from the lengthy text descriptions of Metaculus questions. We’re not far away from being able to extract relevant information from longer papers and publications.

With semantic search, Elicit will help people find relevant forecasting questions that already exist across all forecasting platforms. Better search reduces duplicate work and helps forecasters incorporate background research or existing predictions into any new question they are working on.

Eventually, we hope language models can suggest the complete list of factors - the entire decomposition - for users to review and accept / reject.

On the quantitative reasoning side, we plan to use language models to convert natural language statements into precise distributions. The ideal probability input format varies a lot for each user-question pair. Some people want to express their beliefs using bins. Percentiles are easiest for date questions. Sometimes drawing or visually adjusting curves works best. In other cases, users prefer to specify parameters such as function family, mean, and variance.

To accommodate these varied preferences, we can train a language model to convert any text-based input into a distribution and make a suggestion that the user can approve or reject. We eventually want to learn what a vague statement like “Most likely above 50” means for each user and in each context. We then want to automatically generate for them the right prior that the user can evaluate.

Conclusion

Ought’s mission is to automate and scale open-ended reasoning. We want to make good reasoning abundant. To attain that scale, we need automation and machine learning - human work is too expensive.

Today, machine learning works best when we can gather a large amount of task-relevant data. The most impressive examples involve imitation learning on static large-scale datasets (GPT-3) and reinforcement learning in situations with fast feedback (AlphaGo). We don't yet know how to exceed human capability at judgmental forecasting and in other situations that require qualitative reasoning, have limited data, and face slow feedback loops. With Elicit, we aim to make machine learning as useful for qualitative forecasts made with limited data as it is for data-rich situations today.

In the beginning, people do most of the work and thinking in Elicit; Elicit provides simple workflow automation. At this early stage, we’re studying what good reasoning looks like and how we can automate or support it. Elicit then starts to guess at increasingly complex parts of the forecaster’s thought process, suggesting subquestions, factors, scenarios, related questions, datasets, etc. Users provide ongoing feedback; Elicit evolves with and around each user.

By automating the reasoning and adding value especially in contexts with limited data, we make high-quality thinking and forecasts available even for questions that might only happen once to one person. If we succeed, answering questions like “When will my daughter’s passport arrive?” and “When will this software project finish?” will be as easy as looking up the weather for next week.

If you’re excited about building tools for thinking about the future, there’s plenty of work to do.

Ought Raises $3.8 Million

Jungwon Byun — Fri, 14 Feb 2020 00:00:00 GMT

We’re excited to announce new donations totaling $3.8MM from the following donors:

$2,593,333 from Open Philanthropy ($260,000 of this is made possible by a funding partnership with Longview Philanthropy and Ben Delo)
$900,000 from Paul Christiano
$150,000 from Jaan Tallinn
$100,000 from the Survival and Flourishing Fund
$40,559 total from Nisan Stiennon, Jalex Stark, Girish Sastry, Peter McCluskey, Yuta Tamberg, Community Foundation for San Benito County Calhoun/Christiano Family Fund, Raphael Franke, and others (additional notes here and here)

We are grateful to be working with these philanthropic partners. While each of our donors has their own values and beliefs, we find that as a group they take a long-term and “hits-based” approach to giving. They consider where society will be decades from now, the opportunities and risks we will face along the way, and what they can do now to make the long-term future better. This in turn allows us to focus on what we think is right from a long-term perspective. Our donors recognize that Ought is an unusual research-product hybrid organization. They’re not afraid to engage with the knotty details of our research to collaborate on ways to maximize impact in the long term while measuring signs of progress in the short term.

Our partners’ support allows us to build towards a future world where individuals and organizations arrive at better answers to life's messy but important questions. There are many of these, like

How do I decide whether to buy a house, save for retirement, or pay off my student loans?
Should I pick a job that allows me to directly work on the problems I care about now, or one that will teach me skills to be more effective at solving that problem later?
How should we set our hiring plan for the next 3-5 years given sales forecasts, competition, and other macroeconomic trends?
How should we prioritize among this set of products or features we want to launch?
I really like bread. Is it really that bad to be eating so much bread?

At first glance, each of these questions seems unique. But zooming out, answering them depends on a common core of reasoning. When people answer these questions today, they often

Think about their values and preferences
Gather evidence
Compare and weigh the evidence
Think through plans by comparing different paths or outcomes, making forecasts, or considering the likelihood of different scenarios

As AI and machine learning advance, we want to delegate parts of these processes to machines, especially the parts that machines do better, such as searching over a hundred thousand permutations of paths, gathering evidence from every single page on the Internet, and making forecasts based on gigabytes of data.

To successfully delegate some of this thinking to machines, we need to design systems that help us evaluate work that is too complex for us to evaluate directly. These systems also need to scale. They need to flexibly absorb diverse inputs and productively convert computation into better thinking and decision-making. AI has the potential to make answering these questions 100x or 1000x better.

But there's no guarantee that we’re headed towards this world. There are many pressures to build AI systems that optimize for quick, plentiful reward signals like retweets or video views - signals that look appealing instead of actually being good. It’s more challenging to think carefully about good reasoning and how it can help people figure out the answers they'd give if they had more time to think.

So we and our donors are taking the long-term view and building towards this world. Today, we're working on this problem in small-scale settings, mostly with human experts, but we're developing mechanisms designed to hold up even for very large-scale machine reasoning. If this is the kind of world you want to build too, get in touch.

Evaluating Arguments One Step at a Time

Ought — Sat, 11 Jan 2020 00:00:00 GMT

We’re studying factored cognition: under what conditions can a group of people accomplish complex cognitive tasks if each person only has minimal context?

In a recent experiment, we focused on dividing up the task of evaluating arguments. We created short, structured arguments for claims about movie reviews. We then tried to distinguish valid from invalid arguments by showing each participant only one step of the argument, not the review or the other steps.

In this experiment, we found that:

Factored evaluation of arguments can distinguish some valid from invalid arguments by identifying implausible steps in arguments for false claims.
However, experiment participants disagreed a lot about whether steps were valid or invalid. This method is therefore brittle in its current form, even for arguments which only have 1–5 steps.
More diverse argument and evidence types (besides direct quotes from the text), larger trees, and different participant guidelines should improve results.

In this technical progress update, we describe these findings in depth.

Table of Contents

Methods

Representing arguments as claim trees

In each trial of our experiment, we first sample a random Roger Ebert movie review. An expert is instructed to read the entire review and then generate a root claim about the review that is either at least 90% likely to be true or at least 90% likely to be false. For example:

The film takes a progressive stance on gender relations.

The expert then builds a claim tree of evidence that contains:

Subclaims that support the root claim.
Quotes from the text supporting each subclaim.

A different expert also reads the text and contributes rebuttals, quotes from the text that are intended to undermine each step of the tree. These rebuttal quotes might show that:

The root claim is not supported in the text.
The subclaims are not supported in the text.
The quotes supporting the subclaim are taken out of context.

Each claim tree has a depth (see Figure 1):

A depth-0 claim tree is just the root claim, nothing else.
A depth-1 claim tree has the root claim, quotes supporting the root claim, and rebuttal quotes refuting the root claim.
A depth-2 claim tree has the root claim, subclaims, quotes supporting each of the subclaims, and rebuttal quotes.
In principle, we can generate arbitrarily large claim trees using nested subclaims and rebutting claims. In this experiment, we use at most depth 2.

[![Figure 1](/images/blog/2020-01-11-arguments/figure-1-trees.png "Figure 1")](/images/blog/2020-01-11-arguments/figure-1-trees.png)

Figure 1: Structure of the claim trees used in this experiment.

[![Figure 2](/images/blog/2020-01-11-arguments/figure-2-example.png "Figure 2")](/images/blog/2020-01-11-arguments/figure-2-example.png)

Figure 2: Claim tree example. More [here](https://github.com/oughtinc/data/blob/master/arguments-2020-01/data-readable.md).

Factored evaluation of arguments

To test whether we can correctly evaluate if a claim about a text is true or false by aggregating the work of context-limited judges, we show each step of the claim tree to an experiment participant without showing them the other steps. We then ask how likely the step is to be valid by asking how likely the claim (or subclaim) is to be true given the subclaims (or quotes).

This distributed, stepwise evaluation is the main difference compared to Debate as described by Irving et al (2018). In Debate, the judges usually look at the entire debate before they declare a winner.

Evaluating root claims given subclaims

Some participants saw the root claim, the subclaims, and the rebuttals. They had to answer “Assuming all of the subclaims are true, how likely is the root claim to be true?” with a credence from 0% to 100% likely to be true.

[![Figure 3a](/images/blog/2020-01-11-arguments/figure-3a-subclaim-claim.png "Figure 3a")](/images/blog/2020-01-11-arguments/figure-3a-subclaim-claim.png)

[![Figure 3b](/images/blog/2020-01-11-arguments/figure-3b-subclaim-claim.png "Figure 3b")](/images/blog/2020-01-11-arguments/figure-3b-subclaim-claim.png)

Figure 3: A judge evaluates the credibility of a root claim given subclaims. Position of argument step (top) and simplified screenshot (bottom).

Evaluating subclaims given quotes

Other participants saw one subclaim, the supporting quotes, and the rebuttals. They had to answer “Given these quotes, how likely is the subclaim to be true?” with a credence from 0% to 100% likely to be true.

[![Figure 4a](/images/blog/2020-01-11-arguments/figure-4a-quote-subclaim.png "Figure 4a")](/images/blog/2020-01-11-arguments/figure-4a-quote-subclaim.png)

[![Figure 4b](/images/blog/2020-01-11-arguments/figure-4b-quote-subclaim.png "Figure 4b")](/images/blog/2020-01-11-arguments/figure-4b-quote-subclaim.png)

Figure 4: A judge evaluates the credibility of a subclaim given quotes. Position of argument step (top) and simplified screenshot (bottom).

Claim tree evaluation as binary classification

We want to measure how well factored evaluation (FE) can distinguish true and false claims. The ground truth is provided by the experts who read the whole text. We use the following definitions:

Our ground truth is that a claim is true if an expert assigns it a >=90% probability of being true, and false if they assign it a >=90% probability of being false
FE classifies a claim as true if judges evaluate all steps in its claim tree as likely valid, and false otherwise.

Note that (2) specifies a "weakest-link" semantics. If judges think a single step in the claim tree is likely invalid, then FE classifies the claim as false.

[![Figure 5](/images/blog/2020-01-11-arguments/figure-5-classification.png "Figure 5")](/images/blog/2020-01-11-arguments/figure-5-classification.png)

Figure 5: Examples of claim trees classified as true and false by FE. The green check mark indicates that judges evaluated the step as likely valid, and the orange cross as likely invalid.

Having defined how FE of arguments works as a binary classifier, we can use standard confusion matrix metrics to understand its performance. We highlight the two kinds of errors that are possible:

False positives: A false positive occurs when the expert thinks a root claim is false but FE classifies it as true.
False negatives: A false negative occurs when the expert judges a root claim to be true but FE classifies it as false.

The confusion matrix allows us to calculate:

Recall: What fraction of true claims are classified as true?
Precision: What fraction of claims classified as true are in fact true?
Accuracy: What fraction of all classifications (both true and false) were correct?

We care most about high precision, and only maximize recall subject to that; reducing false positives matters most to us.

Ensembling step judgments

For each step, participants estimated the likelihood that the (sub)claim was true given the evidence. At least 4 different participants evaluated each step.

To decide whether a claim tree is valid, we need to specify two parameters:

Ensembling percentile: take the n-th percentile judgment. For example, the 25th percentile of judgments [10%, 20%, 30%, 40%] is 17.5% with linear interpolation.
Threshold: if the ensembled judgment is above this threshold, call the step valid. A claim tree is valid if all steps are valid.

Data

We generated a dataset of 51 claim trees about 51 Roger Ebert movie reviews. Each claim tree had a depth-0, depth-1, and a depth-2 version. For 49 trees we have 4+ judgments for each step in the tree. We restrict the analysis below to these 49 trees.

We collected 2,722 judgments, split as follows:

	Depth 0	Depth 1	Depth 2
Trees	49	49	49
- Trees with true root claims	24	24	24
- Trees with false root claims	25	25	25
Steps	49	49	207
Mean steps per tree	1	1	4.2
Judgments	826	344	1552
Mean judgments per step	18.86	7.02	7.50

Table 1: Descriptive stats of the claim tree dataset.

Each judgment comes with a natural language explanation by the judge as shown above in Figures 3 and 4.

You can access the full dataset (in json) with all trees, evaluations, and participant explanations here. Trees in human-readable format are here.

Analysis

We present exploratory analysis to synthesize what we learned from this experiment. These results have methodological limitations that we discuss in the Appendix, but we hope that the analysis provides a framework for informing and evaluating future experiments.

Summary

Factored evaluation of arguments can distinguish between some valid and invalid arguments.
1. The depth-0 baseline was at chance in distinguishing true and false root claims.
2. Both depth-1 and depth-2 evaluations exceed this baseline across a range of parameters.
3. For false root claims, factored evaluation of depth-2 claim trees is more likely to identify at least one step in the tree that is unlikely to be valid.
However, high variance in judgments across participants leads to brittle performance. Performance is sensitive to the ensembling parameters.
By analyzing false positives and false negatives, we’ve identified specific problems that we can address.
1. To reduce false positives, we can let the rebuttals include explanations in addition to quotes and instruct claim tree creators to reduce the complexity of individual steps.
2. To reduce false negatives, we can improve quality control for root claims to ensure that they are indeed clearly true or false and increase tree depth to support arguments in cases where evidence is less direct.

Factored evaluation can distinguish some valid from invalid arguments

The depth-0 baseline is at chance

We confirmed that participants did not have strong prior beliefs about our root claims that could have influenced how they evaluated the tree. At depth 0, the median credence for both false and true claims was 50%. In a binary forced-choice task, participants guessed the truth of the claim correctly for 20 of the 51 trees.

Depth-1 and depth-2 evaluations exceed the depth-0 baseline

Compared to depth 0, depths 1 and 2 result in more accurate evaluations across a range of parameter settings, as shown by the fact that there are parameter settings with above-chance accuracy (light blue pixels below).

[![Figure 6](/images/blog/2020-01-11-arguments/figure-6-accuracy-by-depth.png "Figure 6")](/images/blog/2020-01-11-arguments/figure-6-accuracy-by-depth.png)

Figure 6: Accuracy as a function of the ensembling parameters. Depth 0 is at chance. For depth 1 and 2, there are parameter settings with moderate accuracy.

Previously, we said that we’re most interested in high precision (few false positives). If we optimize the threshold and percentile parameters independently for each depth to maximize recall subject to >80% precision, we get the following confusion matrices:

	Depth 0	Depth 1	Depth 2
Percentile	72	30	1
Threshold	59	69	23
Confusion matrix	`2 TP 23 FN` `0 FP 26 TN`	`9 TP 16 FN` `2 FP 24 TN`	`10 TP 14 FN` `_1 FP 24 TN`
Precision	1.0	0.82	0.91
Recall	0.08	0.36	0.42
Accuracy	0.55	0.65	0.69

Table 2: Classification performance for each tree depth with posthoc fitting of ensembling parameters.

In our analysis below, we’ll focus on depth 2. Our results don’t show whether depths 1 and 2 differ significantly. Evaluating where depth-1 evaluations differ from evaluations at higher depths is among the most interesting pieces of follow-up work.

Factored evaluation identifies invalid steps in some claim trees for false claims

The figure below shows the ensembled judgments for each step of each depth-2 tree, separating trees with false root claims (left) from trees with true root claims (right).

Each row represents a claim tree. Each dot across that row represents the 25th percentile of all judgments for one step in that tree; each dot represents a different evaluation step. The leftmost dot is therefore the “weakest step” of this tree. We see that the judgments for false root claims skew more left than those for true claims. There is only one tree with an ensembled judgment below 25% percent for true root claims, whereas there are 9 such trees for false root claims.

[![Figure 7](/images/blog/2020-01-11-arguments/figure-7-weakest.png "Figure 7")](/images/blog/2020-01-11-arguments/figure-7-weakest.png)

Figure 7: Ensembled step judgments for each tree. There are more steps that judges consider very weak in trees with false root claims, but the difference isn't large.

High variance in judgments leads to brittle performance

Step judgments have high variance

We expected that ensembling would be crucial for smoothing out occasional errors, but learned that evaluations across participants vary widely. Below we show the participant-submitted credences for all 207 steps in the 49 depth-2 claim trees:

[![Figure 8](/images/blog/2020-01-11-arguments/figure-8-steps.png "Figure 8")](/images/blog/2020-01-11-arguments/figure-8-steps.png)

Figure 8: All individual judgments (light gray) for all steps of all depth-2 trees, both quotes to subclaim and subclaim to root claim. The green dots show the credence the expert assigned to the corresponding (sub-)claim.

Each row represents a step from our 49 depth-2 claim trees. The two charts are split out by true and false claims: on the left are the steps evaluating claims (both subclaims and root claims) that experts estimated to be less than 50% likely to be true. On the right are steps evaluating claims that experts estimated to be at least 50% likely to be true. The green dots indicate the credence that the expert assigned the claim (all the green dots in the left chart are below 50%, all the green dots on the right are above 50%).

This plot suggests a few things:

Judges' credences vary widely for each step; different judges disagree a lot on whether the same claim is likely to be true or false given the evidence.
Ensembling matters, as we discuss below. We’d have a higher false positive rate if we used the mean of the judgments (vertical black bar) as the overall evaluation for a step.
As we’d hope, participants give higher credences for claims that are true than for claims that are false.
Participants generally seem to think that claims are true. The means of most judgments exceed 50%, even for claims that experts generated as false.
Experts (claim generators) think that their false claims are more obviously false than the judges do (green dots on the left chart skew further left than the judgments).
Some true claims seem obviously true to everyone. Some true claims have very high mean judgments with low variance. We do not see the same for false claims.

Performance is sensitive to ensembling parameters

A limitation of our work is that the accuracy metrics and confusion matrix are highly sensitive to the ensembling parameters (judgment percentile, threshold for ensembled credence). For example, here are three settings and the corresponding metrics:

	Parameter setting 1	Parameter setting 2	Parameter setting 3
Percentile	1st	25th	50th
Threshold	29%	50%	75%
Precision	0.90	0.60	0.60
Recall	0.38	0.50	0.25
Accuracy	0.67	0.59	0.55

Table 3: Different ensembling parameters lead to different metrics.

If we visualize the space of all parameter settings, we see that high values of precision and accuracy (light blue pixels) are sparse:

[![Figure 9](/images/blog/2020-01-11-arguments/figure-9-sensitivity.png "Figure 9")](/images/blog/2020-01-11-arguments/figure-9-sensitivity.png)

Figure 9: Ensembling parameters that lead to high precision and accuracy are sparse.

Qualitative analysis of false positives and false negatives

For the sensitivity reasons presented above, summary statistics present an imperfect picture of the experiment. In this section, we dive deeper into the trees to qualitatively understand why incorrect evaluations occur and how we can reduce the presence of false positives and negatives in future experiment iterations. For the sake of the qualitative analysis below, we choose to call a tree valid if the 25th percentile judgment exceeds threshold 50.

We find that:

False positives are primarily caused by
1. missing evidence that was difficult to highlight in a rebuttal quote
2. individual steps with high complexity, leading to judge mistakes
False negatives are primarily caused by
1. mistakes the experts made in choosing root claims
2. indirect evidence that is difficult to distill into small claim tree

We propose ways to mitigate each of these causes, but haven't implemented the mitigation strategies yet so can't be confident that they would work.

False positives

9 of the 49 claim trees we evaluated were false positives under the threshold set above. Factored evaluation returned that the root claim was true when in fact it was not. The table below summarizes the most common reasons for this incorrect evaluation, followed by discussion of the top two reasons.

Reason for false positive	# of trees	Trees	Ways to mitigate
Judges overlooked the absence of key evidence. Judges failed to notice that an aspect of the (sub)claim is not supported by evidence	8	26, 27, 28, 29, 30, 31, 32, 33	Allow rebuttal via claims in addition to quotes. Better judge instructions and training
Individual steps with high complexity	8	27, 28, 29, 30, 31, 32, 33, 34	More practice and feedback, reduce complexity and difficulty of individual steps (requiring larger trees)
Evidence for claim is ambiguous or uses figurative language, so it’s hard to rebut and judge	3	29, 30, 34	Allow rebuttal via claims in addition to quotes. Train participants on dealing with figurative language
Root claim not necessarily false	3	27, 29, 30	Quality control for root claim generation (multiple people review the root claim before building out a tree.)
Rebuttals poorly chosen	2	29, 34	Quality control for rebuttals

Table 4: Reasons for false positives. A tree is listed for a reason if the reason seemed to make a substantial contribution to the tree being a false positive. Trees may be listed for more than one reason.

Claims with ambiguous or figurative quotes are difficult to rebut and judge

Many claims had the following structure: The review text t contains a short quote q that provides strong evidence for the false claim S absent further context:

*P(S | q) > 0.6*

However, given access to t, the evidence from q is cancelled or explained away:

*P(S | q, t) < 0.1*

Yet it's not easy to extract a 200-character quote from t that would do the same cancelling. In some cases, there is a slightly longer quote (say 300 characters) that would be sufficient. In other cases, the relevant context from t is distributed throughout the review, which might total 5000 characters. The ambiguity in q sometimes resulted from figurative language or irony, where the literal interpretation was more plausible without context.

For example, consider the claim (simplified from source):

"the film features a group of high-school activists"

The supporting quotes are:

"activists are rich kids angry at their parents"
"becomes a movie about slumming brats"

The words "kids" and "brats" could either refer to high-school students or to 20-something adults who are immature (which is what the reviewer intends). Various statements in the review suggest the group are not high-school age (e.g. one is a college graduate) but it's hard to condense these statements into a short quote.

Individual steps are complex. Spotting the absence of key evidence is difficult

Factored cognition aims to break down cognitive work into small pieces, making each step of work easier to check and automate. Unfortunately, many steps in this experiment still required judges to do complex things like evaluating subtleties of the relationship between the evidence and the rebuttal, or weighing fine points of the phrasing or implications of a claim.

One source of complexity that confused judges on multiple trees was subclaims that seemed to justify a root claim but that actually contain an unjustified logical leap.

For example, consider the claim (simplified from source):

“The reviewer expresses that they are glad that the movie’s political statement was distributed globally despite the movie being banned in China”

Subclaims:

“The reviewer emphasizes the movie’s political statements about the interpretation of Chinese history”
“The reviewer emphasizes the movie’s criticism of Chinese Communism”
“The reviewer describes the movie’s politically charged reception”
“The reviewer expresses excitement that the movie is spreading despite being banned in China”

The subclaims appear to support the claim, but actually they include no evidence that the reviewer was specifically glad to see that the movie’s political statement was distributed globally, as opposed to other aspects of the film. And in fact, the aspect of the movie that the reviewer was excited to see disseminated was its artistic take on history, not its political statements. Judges overlooked the missing evidence, and it was impossible for the rebuttal to directly point out the flaw in the evidence because the rebutter could not express the flaw in direct quotes from the text.

False negatives

12 of the 49 claim trees we evaluated were false negatives. Factored evaluation returned that the root claim was false when in fact it was true. The table summarizes the most common reasons:

Reason for false negative	# of trees	Trees	Ways to mitigate
Evidence for root claim is ambiguous or indirect, so it’s hard to make a convincing small tree	6	13, 14, 19, 20, 21, 23	Increase tree depth
Truth of root claim is ambiguous	5	15, 20, 21, 22, 23	Quality control for root claim generation through more ensembling or review
Claim tree didn’t provide clear enough evidence for claim (claim tree creator mistake)	3	17, 18, 23	Quality control for tree construction through more ensembling or review
Ambiguous or poorly defined claim (claim tree creator mistake)	3	19, 20, 21	Quality control for tree construction through more ensembling or review
Overly specific intermediate claims (claim tree creator mistake)	3	13, 16, 19	Quality control for tree construction through more ensembling or review

Table 5: Reasons for false negatives. A tree is listed for a reason if the reason seemed to make a substantial contribution to the tree being a false negative. Trees may be listed for more than one reason.

Claim tree creators make mistakes

As described above, we evaluated root claims by checking the probability judges assigned to the weakest step in the claim tree. If any one of the steps evaluating whether quotes support a subclaim, or whether subclaims support the root claim, is invalid, then the root claim evaluates to false. For 8 of the 12 false negatives, mistakes made by the experts constructing claims, subclaims and quotes contributed substantially to the result. These could be mitigated by quality control measures such as allowing experts to get more feedback during the generation process.

Making convincing trees is difficult if evidence for the root claim is indirect

Some false negatives seem to result from fundamental limitations of factored evaluation with small trees. Suppose that a movie is artistically innovative or avant-garde. Instead of stating this explicitly, the review might spend two paragraphs describing the scenes that make it avant-garde. It might be hard to convey the overall effect of those two paragraphs in a few short quotes. Similarly, the reviewer might suggest to readers that a movie is good without stating it. Here's a paragraph from a review (tree data):

This is a film about the utter indifference and outright hostility that people encounter every day, and how essentially decent people like Ruth suffer and suffer through it, almost always silently, until they finally snap. The break-in is the culmination of a series of unfortunate encounters: she has to deal with an old racist at the nursing home where she works. She gets stuck in traffic and spies a jerk in a pickup truck at the head of the lane whose tailpipe spews inky smoke as he revs his engine. In a scene that will break the hearts of many regulars who read reviews, Ruth enjoys a drink at a neighborhood bar while reading a new book, only to have a plot twist casually spoiled by another customer that she initially mistakes for a nice guy.

No single sentence in this paragraph conveys much about the quality of the movie, but the paragraph as a whole is positive. We anticipate that increasing tree size will allow for discussions about such nuances and improve overall performance.

Conclusion

We ran an exploratory experiment in which a distributed group of participants evaluated tree-structured arguments that make claims about movie reviews. We started with shallow arguments that have 1–5 steps and measured success using common classification metrics (precision, recall, accuracy). We found that:

Factored evaluation of arguments can distinguish some valid from invalid arguments by identifying implausible steps in arguments for false claims.
Experiment participants disagreed a lot about whether claims were true or false. This method is therefore brittle in its current form, even for arguments which only have a few steps.
More diverse argument and evidence types (besides direct quotes from the text), larger trees, and different participant guidelines should improve results.

Over time, we’d like to show that accuracy improves as we increase the depth of claim trees, and that we can apply methods like this to much longer texts. A depth-5 tree should reliably discern the truth of a larger set of claims than a depth-2 tree, and we should be able to evaluate claims about entire collections of books, not just single-page reviews. Eventually, we want naive judges to spot-check complex arguments from domain experts even when the judges are entirely unfamiliar with the domain.

We’re excited that this experiment established foundations such as operationalizing success for experiments in factored evaluation and creating benchmarks for us and others to improve upon in future work.

Appendix

Acknowledgments

We’d like to thank many different people who contributed to the experiments and their presentation in this blog post.

The research was done by William Saunders, Ben Rachbach, Owain Evans, Jungwon Byun, and Andreas Stuhlmüller. Zachary Miller and Andrew Schreiber built the infrastructure.
Our experiment participants provided important data and feedback on the experiments. William K, Erol C A, Karin N, Henry A, Julian D, Vojtech B, Henrique D B, Liam D, and Eric H in particular contributed many hours to the experiments.
Feedback from Beth Barnes, Vishal Maini, and Milan Griffes helped make the blog post clearer.

This work was supported by many donors, including the Future of Life Institute (RFP2-178).

Citation

Please cite this blog post as:

Saunders et al. (2020). Evaluating Arguments One Step at a Time.

BibTex citation:

@misc{ought2020arguments,
  author = {Saunders, William and Rachbach, Ben and Evans, Owain and Miller, Zachary and Byun, Jungwon and Stuhlmüller, Andreas},
  title = {Evaluating Arguments One Step at a Time},
  year = {2020},
  howpublished = {\url{https://ought.org/updates/2020-01-11-arguments}},
  note = {Accessed 11-January-2020}
}

Methodological flaws and room for improvement

We’re excited about these initial results and about having a more concrete framework for running factored evaluation experiments, but we also recognize that our work is far from perfect. We want to improve upon the following next time and hope readers will cautiously interpret our results in light of these limitations.

Sample root claims independent of the claim tree generation process

We don't believe that our results apply to a broad set of claims because:

The same expert generated both the claim and the corresponding claim tree. The expert was told to generate claims that are best supported by depth-2 claim trees.
This generation was done by Ought employees who understood the goals of the experiment and may have been biased in a particular way.
The inferential gap between the text and our claims was small (by necessity due to small tree size). Our results may not provide much information about claims that require more complex inferences about a text.

The fact that performance on these claims was ambiguous suggests that we didn’t stumble upon a narrow set of convenient claims, but we want to control for this more carefully in the future by, e.g., generating claims independent of claim trees.

Check that claim tree generation doesn't have systematic biases

Future experimenters may want to check that the process used to generate claim trees doesn't distort the results. For example, untrained experts could be worse at supporting false root claims than true root claims, or bad at rebuttals for particular types of claims.

Control context for depth-1 and depth-2 judges more carefully

The amount of text that a judge can read to evaluate their step should be the same at all steps and across all depths so that we can isolate the impact of adding more steps at increasing depths. However, some depth-2 judges had more context than depth-1 judges. Judges who evaluated whether or not a root claim was true in light of subclaims saw up to 400 characters of subclaims + 200 characters of rebuttal quotes. All depth-1 judges only saw 200 characters of quotes + 200 characters of rebuttal quotes. Some of the extra characters at the subclaim-to-root-claim level were template characters that provided no new information, which means that the actual difference was smaller.

Even with this additional advantage for depth 2, we don’t see much differentiation between depth 2 and depth 1. For experiments that do establish a difference between depth 1 and 2, controlling context size will be important.

Pre-register the experiment

A future iteration of this experiment should have more features of the experiment defined upfront. In this iteration:

We chose ensembling percentiles and thresholds after seeing the data. We did set a threshold beforehand informed by past work, but the setup differed enough that comparing to our ex-ante thresholds wasn’t helpful.
We didn’t control the total number of judgments per step. We limited our analysis to trees with a minimum of 4 judgments for all steps but some of those steps had more than 4 judgments, while others had exactly 4 judgments. We had to balance the distribution of judgments collected per step with considerations like information contamination or providing a reliable stream of work for participants and chose to err on the side of collecting more data when possible.
Instructions to judges changed slightly throughout data collection as we received feedback from participants. These changes did not seem like they would change results meaningfully to us e.g. they provided more specific instructions for dealing with information contamination.

Clarify the task to reduce variance across judgments

The high variance in judgments we discussed in the analysis section suggests that our task is insufficiently clear to participants. It may also be worth starting with an even simpler task (such as judging arguments about arithmetic).

Minimize information contamination

Given the pool of participants we had access to, many participants evaluated multiple steps from the same tree. In the worst case, this could lead to “information contamination”, where a participant’s judgment for a step is different from the judgment they would have made if they had no context.

We took steps to mitigate this. We avoided scheduling people to the same tree when possible, we asked participants if they were contaminated and excluded their judgments if so, and each participant only saw the depth-1 or depth-2 tree, not both. A larger pool of participants will minimize the likelihood of contamination further.

Test rebuttals as claims, not just quotes

Instead of being a quote, each rebuttal could be a claim, with supporting quotes and a rebuttal of its own. This would make rebuttals easier to interpret.

Clearly show that depth 2 outperforms depth 1

We want performance to improve with greater depth—everything we do at depth 2 shouldn’t be done just as easily at depth 1. This is more of an improvement opportunity than a methodological limitation of this experiment. It’s also possible that depths 1 and 2 are too close and that we need to compare a larger depth to depth 1 or 2 to see a difference.

Progress Update October 2019

Jungwon Byun and Andreas Stuhlmüller — Mon, 28 Oct 2019 00:00:00 GMT

This is an update on our progress towards our goals over the last ten months. If you can only read 650 characters of this update, like the judges in our experiments, here’s what you need to know:

We switched from experiments that break down tasks (factored generation) to experiments that break down evaluating expert work (factored evaluation)
60+ participants have been working 150+ hours per week on our experiments
We're building Mosaic2, an app that streamlines running varied question-answer experiments (factored evaluation, debate, etc.)
We're exploring if language models can automate decompositions, getting 30% accuracy on the Complex Web Questions dataset
William Saunders joined as ML engineer, Jungwon Byun as COO
We’re hiring an engineering team lead and a business operations person. We’ll pay $5000 for a successful referral!

What we do and why

We believe that the most important questions facing humanity are complex and open-ended. These questions range from “What types of policies will effectively curb climate change?” to “How should we deal with the potentially transformative impacts of AI?” and “What career should I pursue?” Despite their importance, such open-ended questions are answered poorly today.

As AI plays a bigger role in society, the world will likely get more complex. It will become even more important to give good answers to questions like these. Unfortunately, AI is not on track to help substantially with answering these open-ended questions. So far, we only know how to use AI to help us with tasks that have clear metrics or fast empirical feedback loops.

Our mission is to make AI just as useful for open-ended questions. Figuring out how to direct the most powerful technologies of our time to the most important questions society wrestles with is a highly leveraged way to have a large, positive impact. Rather than directly tackling climate change, or poverty, or animal suffering, we’re improving the process by which decisions on all of these issues get made.

To apply AI to questions like these, we design, test, and implement mechanisms for delegating open-ended cognitive work to experts who are only trying to optimize clear feedback signals. Our work today involves running experiments with human participants, building web apps to gather data from and structure the experiments, and connecting what we learn from human experiments to ML training. Over time, we'll incrementally automate the work of our human participants and build a platform that deploys ML to answer open-ended questions.

Following up on our December 2018 update

We ended our last update with the following goals for the first half of 2019:

Run more multi-user experiments. Get to a point where we can gather substantial evidence on the feasibility of factored cognition.
Continue our foundations research program, integrating reflection, laziness, distillation, speculative execution, and scheduling via question-answering into a single prototype.
Over time, consolidate Mosaic, Affable, and potentially Relay into a single app.
Fill our open roles: COO, web developer, and experimenter.

We’ve done 1 and 3, parts of 2, and most of 4. We’re still hiring for an engineering team lead!

Experiments with human participants

Breaking down tasks to breaking down evaluation

As of our last update, we were running factored generation experiments. In these experiments, participants break down a complex task into easier tasks, delegate the easier tasks, and use the solutions to these tasks to complete the larger task.

For example, a participant in a factored generation experiment might get the question “What are all of the nouns in the sentence below?” and they would have to return a list of nouns.

Since March of this year, we've switched to running factored evaluation experiments, another instance of factored cognition. Instead of breaking down the original task to complete it, we instead break down the evaluation of solutions to the task.

The factored evaluation version of the question above looks like: “Is ‘dog’ or ‘cat’ a better answer to the question ‘What are all of the nouns in the sentence below?’” and the participant chooses one of the answers.

There are a few reasons why we concluded that factored evaluation is a better research direction for now:

If there are experts in the world that can already generate good answers to questions, we should just use those capabilities as is, without reproducing them using decompositions. In particular, we'd like to use ML systems that are trained end-to-end as long as we can evaluate the quality of their outputs.
In many cases, evaluating solutions is easier than completing the task. It's easier to check that a solution to a Sudoku is indeed a solution than to come up with it in the first place.
After running a few experiments, we realized that breaking down tasks of any interesting complexity using factored generation requires extremely large trees of work. We could get around this by summarizing entire subtrees using experts ("oracles") and focusing on the most difficult subtrees. If we then choose which subtrees to expand using adversaries who win if they can find a case where decomposition doesn't work, we're already close to the factored evaluation setting.
For training ML systems, factored generation on its own is an incomplete solution if we’re working with agents that can't exactly imitate the demonstrated behavior. Factored evaluation defines a reward function that specifies what better/worse behavior looks like. Factored generation only demonstrates intended behavior. To define better/worse behavior for generation, we still need a distance metric that can compare any given agent behavior to this demonstration.

Deployed systems will likely use both factored generation and evaluation, or different mechanisms altogether that combine the best of both worlds, so we expect to test a combination of both in the future.

Our current experimental setup

Our factored evaluation experiments study how a weak human judge can supervise experts who know more and have more time to think. We’re starting with reading comprehension questions where we create an artificial gap between a judge (who doesn’t get to read a text) and experts (who do).

Skipping over a few details, our experiments have the following structure:

There is a person, the judge.
The judge faces an overall (root) question: “What does the author of this Pitchfork music album review think of the work being reviewed?”
This judge is handicapped: they can read at most 650 characters, so they can never read the whole review. Thus, the judge does not have the context required to answer this root question.
However, the judge has access to two experts who can read the whole text and who provide two possible answers.
Unfortunately, only one of these experts is honest, the other is malicious, and is trying to trick the judge into accepting a wrong but plausible-sounding answer.
Without ever seeing the whole text, and only getting information through the experts, the judge must ask follow-up questions to the experts to decipher which answer to the root question is honest and select that one.
No one can lie about quotes or quotes’ positions in the text: the quotes from the text are the ground truth anchoring this game.
Up to 6 total questions can be asked by the judge before a decision must be made.

Whenever the judge asks the experts a question, this generates a new experiment: Now a different judge must decide which of two expert answers to that question is honest and which is malicious, using the same recursive process. For this to terminate, eventually a judge must choose an answer without asking any subquestions.

What we learned about factored evaluation

Ultimately, we want to show that this approach to evaluating experts’ answers allows the judge to identify the correct answer even when they have limited access to information. To do this, we have to show that the judge selects the honest answer consistently, even when the malicious expert is very strong or has a lot of advantages.

We haven’t yet shown that factored evaluation works reliably in this reading comprehension context. Over the past few months of testing factored evaluation, we’ve learned that:

Good strategies don’t emerge organically

Our factored evaluation experiments were originally structured like a game. Our initial hypothesis was that if we could establish the rules of a game and provide participants with feedback, then strategies for what questions to ask in what context would emerge organically as more people tried to win. While we’ve seen some interesting norms and tactics emerge, we’ve concluded that this is not the most effective way to develop strategies because:

Many of the decomposition strategies that participants tried were overfit to the specific rules and the Pitchfork reading comprehension context (understandably so). But Ought cares about strategies that show promise of generalizing to multiple contexts, especially to evaluating answers to open-ended questions.
It's difficult to come up with good strategies, for both participants and Ought employees. We’ve updated towards thinking that a few people need to spend concentrated time working on this, not intermittent time here and there. Additionally, generating promising strategies benefits from a background in probability and logic.

Some of the strategies we've seen so far (designed for internal consumption):

Claim Validation v0.3 / v0.2 (Erol Can Akbaba)

Idea: Split each of the expert answers into distinct claims, have the experts provide evidence for each claim and arguments about the relative goodness of evidence.
Malicious Credences (Liam Donovan)

Idea: Ask the experts to state probabilities on which answer is better, and which aspects of each answer are likely to be misleading; ask them to provide evidence that will most help the judge update their beliefs.
What procedure should I use? (Paul Christiano, William Saunders)

Idea: Always ask the experts "What procedure should I use to decide which of the two answers is better?" and then implement that procedure.
Factored Justification (William Saunders) / Factored Explanation (Mati Roy)

Idea: Ask the experts to produce a complete hierarchical argument in a standardized format for each of the answers upfront, then check the two arguments.

While none of these strategies seem sufficient on their own, we think that they all contain seeds for promising future strategies.

Work needs to be ensembled to reduce error rates

No matter how skilled or well-trained a participant is, they will still make mistakes. They might sometimes misinterpret the question or the text and accidentally return or select an incorrect answer to the question they’re working on. If we want to eventually deploy our question-answering systems in high-stakes situations, we can’t afford these mistakes. We’ve realized the need for building systems and processes that aggregate answers across participants to ensure that the malicious expert never wins simply because someone made an avoidable mistake. As a first step, we're looking into ensembling: we give the same subtask to multiple people and then use a voting process to determine which of their answers to go with.

Trees get large in factored evaluation, too

By default, trees don't just get excessively large in factored generation, but in factored evaluation as well. Here is an extreme example of a tree that went on for over 2 months without completing i.e. without an answer being selected at the root level. In this tree, 42 subquestions were asked, with 195 assignments worked on by 22 different participants.

Trees need to finish for us to validate whether a strategy reliably leads to the right answer being selected. To resolve this issue, we rolled out a version of the game we call Most Interesting Branch. In Most Interesting Branch, the honest and malicious expert agree to explore just one path down the question-answer tree, and to yield to the other’s answer for the rest of the tree.

Priorities

Over the next months, we’ll focus on developing strategies in house and testing factored evaluation more modularly. Instead of trying to get participants to come up with promising strategies by playing and improving at a game, Ought employees will devise strategies that we think should consistently select the right answers and generalize beyond the reading comprehension context. We’ll then test strategy execution in a more incremental fashion, starting with single-layer trees and producing robust guarantees at each step along the way.

We value getting feedback on different approaches to experimentation. We’ve assembled an experiment review board of 10 academics, including professors at Stanford, UCSD, Berkeley, Harvard, ANU, and Wharton. We trust their judgement on experimentation and think that such a board will help us run experiments more rigorously as well as broaden the reach of our research. If you have thoughts on how we can run better experiments, reach out to us at experiments@ought.org.

Machine learning projects

Today, machine learning systems are not advanced enough to do open-ended reasoning, so we're primarily running experiments with human participants. Longer-term, we’ll automate the work of participants in the experiments described above, such that the decompositions, expert answers, and answer evaluations are all produced by machine learning systems.

To ensure that our research with human participants doesn’t deviate too far from what is needed in the future to work with ML, and to better estimate when all of this work can be automated, we ran the following projects:

Automating decompositions in narrow domains

Complex Web Questions

First, we took the Complex Web Questions dataset, which contains questions like this:

The actress that had the role of Martha Alston, plays what role in Finding Nemo?
Which school that Sir Ernest Rutherford attended has the latest founding date?
What movies does Leo Howard play in and that is 113.0 minutes long?
Where is the end of the river that originates in Shannon Pot?

We built an end-to-end system using GPT-2 that breaks the questions into subquestions, queries Google to answer each of the subquestions, and aggregates the answers back together to answer the original question. Currently, our system answers about 30% of the questions in CWQ correctly.

Numerical estimation

We also started compiling our own dataset of numerical estimation questions, questions like:

How many cells are in an adult Paedophryne amauensis frog?
If plants stopped "making" O2, how much time of breathable air do humans have?
How much do all the world’s beards weigh?

We learned that this dataset needs to be highly structured for GPT-2 to learn how to break down the initial question into subquestions based on human demonstrations. Currently, our data format looks like this:

|||| | ----- | | Question || How many cells are in an adult Paedophryne amauensis frog? | | Formalization | number | cells in an adult Paedophryne amauensis frog | | A1 | volume | adult Paedophryne amauensis frog | | A2 | volume | cell in adult Paedophryne amauensis frog | | Aggregation | | A1 / A2 | |||||

In this dataset, our current ML predictions match human decomposition steps 15% of the time on our validation set.

For both numerical estimation and Complex Web Questions, we view these results as initial (weak) evidence that fine-tuning general-purpose language models on decomposition tasks might be promising. To better understand how true this is, we'd like to study in future work cases where getting decompositions right requires world knowledge that the model has learned in its unsupervised pretraining phase.

Estimating how model performance scales in dataset size

Over time we'd like to estimate quantitatively how much data we need to automate the work our participants do. As a first step in that direction, we explored the effects studied in the Hestness et al (2017) paper “Deep learning scaling is predictable, empirically”. Hestness et al showed that, across a number of domains including image classification, language modeling, and speech recognition, there is a power-law relationship between dataset size and validation loss. That is, to halve validation loss you need some constant k times more data, where k depends on the task.

We replicated their results using transformer models on small decomposition tasks (Complex Web Questions, numerical estimation, math word problems). Calculating k for numerical estimation tasks of different kinds based on small-scale initial data collection helped us converge on the structured data format above. If you're training language models and are deciding what kind of data to collect, you might want to run a similar exercise to estimate ahead of time how much data you'd need to achieve a particular validation loss.

Software engineering

Ought’s engineering team owns Mosaic, the web app we use to gather data from our experiment participants. Mosaic was initially built around the factored generation experiments we mentioned earlier, which makes it suboptimal for our current experiments. We’re excited about running many different types of question-and-answering experiments in the future, so we’ve started working on Mosaic2.

Mosaic2 is a more flexible web app that simplifies setting up varied experiment mechanisms. In Mosaic2, teams can specify the types of interactions they want to have with experiment participants. Without building separate apps, they can easily run factored evaluation, factored generation, or debate experiments.

Mosaic2 is still under development, but we’re excited to launch it soon! If the idea of building an app that structures and aggregates the thinking of a large crowd of participants excites you, check out this opportunity to lead the team building it.

Organization updates

Team

William Saunders joined us in May as a Research Intern and decided to extend his time with us through the next year. William is leading the machine learning projects described above.
Jungwon Byun joined as COO in June. She runs finance, legal, recruiting, and experiment operations.
Our biggest hiring priorities are hiring an Engineering Team lead and someone to join our Operations team.

Collaborators and contractors

The following contractors and collaborators also contributed to our work:

Zachary Miller and Andrew Schreiber are supporting Mosaic 1 and 2. They're building features like optimized scheduling algorithms for matching participants to cognitive work, and have tackled architectural challenges like how to concisely specify experiments as state transition functions.
Mati Roy is helping manage our contractor network and experiment logistics.
Milan Griffes and Ben Goldhaber have helped support various projects on the business operations side, including recruiting, payroll, and visa processing.

Special thanks to:

Jeff Wu for helping us finetune GPT-2 for our purposes
Beth Barnes and Long Ouyang for providing regular feedback on our experiments
Our amazing participants, who are so flexible about testing different rules and approaches, sneak in Easter Eggs of humor into their malicious answers, hold us accountable in our work, and support each other in tough times

New donors

Since December 2018, we’ve received generous donations from the following people and institutions:

Paul Christiano, who explains his decisions to donate in this post on LessWrong
Ben Delo, advised by Longview Philanthropy
The Centre for Effective Altruism’s Long-term Future Fund
Nisan Stiennon, who “chose to support Ought because research into factored cognition is a promising way to attack the AI alignment problem”
The Future of Life Institute (the second disbursement of a previous award)

Content

Andreas gave a talk at EA Global in June, explaining the challenges of delegating open-ended question-answering to experts, whether humans or machines. He argues that questions without easily checkable answers are particularly difficult and yet particularly important, and demos Ought’s experiments designed to make progress on this challenge.
Also linked above, Paul Christiano wrote a post on LessWrong explaining why he thinks Ought is “one of the most promising projects working on AI alignment.”
William and Owain published a document describing three concrete machine learning projects that researchers interested in Iterated Distillation and Amplification (IDA) might be interested in tackling:
- Applying IDA to math problems, and solving them by breaking them down into easier math problems
- Applying IDA to Neural Program Interpretation
- Using adaptive computation techniques to decide when to rely on a fast distilled model vs. run a more expensive decomposition

How you can help

If you’d like to help with our work, you can:

Refer candidates for our Engineering Team Lead role. We’re currently offering a $5,000 referral bonus to the person that introduces us to the right candidate.
Introduce us to candidates interested in joining our Operations team. Here’s a letter explaining the role and the type of person we hope to work with.
Donate - we’re a 501(c)(3) non-profit and none of the activities above could happen without funders like you!

For more updates like this one, sign up for our newsletter.

Thanks!

Talk transcript: Delegating open-ended cognitive work

Andreas Stuhlmüller — Thu, 01 Aug 2019 00:00:00 GMT

We've published an edited transcript for a talk I gave at EA Global 2019. This talk gives an update on the core problem we're trying to solve at Ought and shows what our current experiments look like.

The summary:

In the long run, we want machine learning (ML) to help us resolve open-ended questions like “Should I get this medical procedure?” and “What are the risks in deploying this AI system?” Currently, we only know how to train ML if we have clear metrics for success, or if we can easily provide feedback on the desired outputs. This modified talk (originally given at EA Global 2019) explains some of the mechanism design questions we need to answer to delegate open-ended questions to ML systems. It also discusses how experiments with human participants are making progress on these questions.

Read the transcript

Progress Update Winter 2018

Andreas Stuhlmüller — Mon, 31 Dec 2018 00:00:00 GMT

This is an update on our progress towards our mission in the second half of 2018.

Our mission is to figure out how to use machine learning to support tasks that require thinking and reflection. We focus on approaches that are “scalable” in the sense that better ML or more compute makes them increasingly helpful for supporting and automating deliberation, without requiring additional data generated by humans.

Over the last six months, our research focused on factored cognition as a potential route towards scalable automation of human-endorsed reasoning. This approach is based on decomposing the cognitive work involved in answering difficult questions into small, mostly context-free pieces that can be supervised by individuals who don't know the big picture.

Our work was split between better understanding the foundations of computing paradigms for factored cognition, building apps that facilitate experiments with human participants, and running such experiments.

Towards these goals, we:

Extended Mosaic, our app for recursive question-answering, with scheduling functionality. This allowed us to run multi-user experiments.
Created a new lightweight app, Relay, for running factored cognition experiments where participants work on a single shared workspace in serial.
Ran about 100 participant-hours of experiments, split evenly between Mosaic and Relay.
Created Affable, a Haskell-based recursive question-answer system that supports cache-based automation and is our new basis for experimenting with more advanced features.
Hired a research engineer, Derek Elkins.
Received grants from FLI and the Long-Term Future EA Fund.

We provide context and details for these achievements below.

Table of Contents

Foundations

To better understand the foundations of factored cognition, we've started a new Haskell-based command-line prototype for recursive question-answering, and we've continued doing conceptual work.

Affable: A Haskell-based lab for recursive QA systems

Affable is a new open-source prototype for recursive question-answering, implemented by Derek Elkins in Haskell. It is the successor to Patchwork, our previous Python-based prototype.

Affable was designed to serve as a platform for multiple prototypes to explore the design space of recursive question-answering systems. The current version of Affable presents a command-line interface similar to Patchwork's. Affable is effectively an interpreter for a simple functional language, executing code provided on an as-needed basis via human interaction. As a result, automation resulting from cached human actions can be exported as reasonably readable Haskell code.

Affable maintains all of its state in an SQL database. Making the data model explicit and externally accessible has several advantages:

It supports upcoming features like reflection and distillation.
It helps diagnostics, instrumentation, and scalability.
It simplifies the transition to a web app later on.

To verify that arbitrary computation can be automated using a finite number of human interactions, Derek implemented an abstract machine for the call-by-value lambda calculus, the CEK machine, through question-answering. Asking the question "What does [expression] evaluate to?" for two small example lambda terms, namely (λx.x)(λy.y) and (λx.λy.x)(λw.w)(λz.z), was enough to completely automate the evaluation of all future lambda terms, including Church numeral exponentiation and infinite loops. Derek verified the correctness of the learned behavior by looking at the automation state formatted as code.

Conceptual work: Scheduling, pointers, automation

As part of the implementation of Affable, Derek wrote internal documents on the semantics of pointers and on potential extensions for cache-based automation.

Following up on our taxonomy, I've been writing about approaches to scheduling cognitive work. Any app for factored cognition needs to decide which workspace should be worked on next, by human participants or by automation. The most natural approach, in the long run, is to delegate the decision about how to allocate cognitive work to the question-answering system (e.g. by asking "Which workspace should we show next to user #312?"). This seems worthwhile to think about on its own terms, but also as a representative instance of meta-reasoning within the QA system. I expect this sort of meta-reasoning to be critical to the promise and challenge of factored cognition.

We haven't cleaned up any of these notes yet, and doing so is not high priority for us, but we would be happy to share them with interested parties.

Apps for experiments

To run factored cognition experiments with human participants, we've continued building our web app for recursive question-answering, Mosaic. We've also started a new exploratory short-term project, Relay, which has its own lightweight web app that lets us explore what decomposition strategies people come up with when we don't enforce recursive question-answer structure.

The Mosaic and Relay apps don't support automation or sophisticated programming language features, but they let us gather empirical evidence now while we work out more full-featured systems as part of our foundations research program. We expect that foundations and empirical research will merge over time, and that later experiments will increasingly supplement human work with automation.

Mosaic: A feature-rich app for tree-structured experiments

Over the past six months, Zak Miller has overhauled the style of Mosaic, made numerous usability improvements (such as automatically exporting any text enclosed in square brackets), and he has introduced the following features:

Automated scheduling: A scheduler ensures that multiple users aren’t working on the same workspace at the same time, that work gets distributed evenly across top-level questions, and that subquestions get resolved before their parents. Previously, we thought we'd be able to run experiments with human managers instead of automated scheduling. This turned out to be more difficult for the human scheduler than we expected, but once we implemented automated scheduling, we were in a good position to run multi-user experiments.

Time budgets: Workspaces are now assigned an amount of time that serves as a budget. This budget is spent when a user works on the workspace, and can be passed along when a user creates a subquestion.

Bandwidth constraints: Time budgets are not the only way to push users to break up cognitive work into small units. Zak has also implemented bandwidth constraints, where each workspace limits the incoming and outgoing number of characters.

Oracles: Sometimes it’s clear that a certain question can be answered using decomposition. In this case, we don’t learn much by having participants go through this process. To address this, we’ve implemented oracles. Oracles are users who have unlimited budgets and full background information. Oracles directly answer the sub-questions that would provide the least evidence about whether decomposition could work, and leave the more interesting questions for non-oracle participants.

Relay: A lightweight app for sequential experiments

Ben Goldhaber created the Relay web app. While Mosaic is based on a particular hypothesis about how to structure distributed cognition, the aim behind Relay is to create a more minimal, flexible app that allows us to explore alternatives.

In each Relay experiment, participants iteratively make changes to a workspace. There is a strict time limit per person (e.g. 10 minutes), but we don't enforce decomposition otherwise.

The web app is composed of two main views. The first is a dashboard, where participants can click to start working on a new question, view previous questions they’ve participated in, and chat with other participants.

The second is the workspace page, which enforces the time limit and hosts tools that a participant can use to answer the question and pass information to future participants. The tools are customizable per question so that we can experiment with different types of supporting infrastructure (e.g. Google Docs, Workflowy, spreadsheets, IDEs).

Experiments

We've been running multi-user experiments for recursive question decomposition (Mosaic) and sequential workspace edits (Relay). In both cases, our long-term goal is to understand to what extent hard cognitive tasks can be solved piecemeal, and what infrastructure and strategies enable this.

Our initial goal is to learn how to run these sorts of experiments in a way that provides evidence on the feasibility of decomposition, and to get to a stage where "simple" problems can reliably be solved in a factored way. In other words, our experiments so far are "exploratory experiments", i.e. experiments that are not aimed at providing substantial evidence for or against the promise of factored cognition, but that help us refine our methods (apps, strategies, choice of questions), and generally help us build intuition for the domain we're studying. This will probably remain the case for at least another few months, and possibly significantly longer.

Recursive question decomposition experiments

In the last two months, we ran seven one-hour online experiments using Mosaic, about one per week. Each experiment had 7-10 participants from OpenAI, MIRI, and similar organizations, for a total of about 50 hours of participant-time. This time was spent on a mix of decomposing questions and discussing feedback and strategies.

For all but one of our experiments, participants collaboratively answered SAT reading comprehension questions (example). For all but our most recent experiment, participants made contributions under time constraints, generally 90 seconds per workspace.

Let's look at our second experiment in more detail to give a flavor, and then review summary information for all experiments:

Participants saw one workspace at a time, such as this workspace.
For each SAT question, we can visualize the decomposition generated by participants as a tree.
Participants answered five top-level questions, corresponding to five trees (1, 2, 3, 4, 5).
During the experiment, participants met in a chat and discussed strategy and issues. This feedback document summarizes the discussion.
In addition to feedback, we have also been collecting question decomposition strategies. (This document was built up over the course of multiple experiments.)

We have similar data for the other six experiments.

What have we learned?

Iteration in teams: Early on, from internal experiments and discussion, we learned that we want a mostly persistent group of participants that can acquire expertise in how to decompose questions, and that can iterate on strategies.
No holds barred: In early internal experiments, we required users to limit what they do (e.g. don't use all available time, don't expand too many pointers). To simplify participating, we switched to the "no holds barred" setting where participants can do whatever they want (within the constraints imposed by Mosaic) to solve the overall question.
Infrastructure improvements: We learned a lot about how to build better infrastructure for decomposing questions. Almost all of the improvements discussed above in the section on Mosaic app improvements are the result of suggestions by participants, and the seven feedback documents contain many suggestions that we haven't yet implemented.
Limit bandwidth, not time: We learned that, while time budgets (e.g. 90 seconds per workspace) do enforce factored cognition, the frantic action they induce might make the problem harder than necessary. We are now more optimistic about bandwidth-limited experiments (e.g. each workspace can read/write at most 400 characters of input/output).
Need for oracles: It became more salient that factored cognition requires a lot of time to complete interesting questions because of the increased overhead compared to single-person problem-solving. Reading comprehension questions that take a single person 5-10 minutes take at least 1-2 hours with factored cognition. Since our experiments were limited to an hour, we only solved a few of the problems we attempted, even though it seems that we could solve many or all of them if we gave participants more total time. We are addressing this issue by adding oracles that replace the subtrees that seem least informative by directly providing answers.
Need for norms: It became more salient that establishing norms followed by all participants matters a lot. For example, should you always try to use up all the time budget your subtree has, or should you return it if it seems that you can't make good use of it? The strategies document collects norms proposed by participants. We haven't yet gotten to a point where they are consistently endorsed and followed.
Feasibility of factored cognition: I'm hesitant to draw object-level conclusions from the experiments so far, but if I had to say something, I'd say that factored cognition seems neither surprisingly easy nor surprisingly hard. I feel confident that our participants could learn to reliably solve the SAT reading comprehension questions with a bit more iteration and more total time per question, but it has taken iteration on this specific problem to get there, and it's likely that these experiments haven't gotten at the hard core of factored cognition yet.

Over time, we'll push experiments towards being more representative of the sorts of decompositions that we'd use when the aim is to provide training data for ML. This means that participants will use fine-grained decompositions that look more like algorithms for representing and reasoning with concepts ([1], [2]).

We also want to try questions from a wider range of domains and slowly scale up the number of participants for these experiments.

If you're interested in participating, please fill out this form.

Sequential programming experiments

Over the past months, Ben Goldhaber has run Relay experiments. Participants solve a tricky problem in sequence, with each participant contributing 10 minutes of work and then leaving, passing on their notes to the next participant. The goal is to see what strategies participants come up with if we don't enforce the sort of recursive decomposition used in Mosaic.

While the Relay approach is fully general, Ben has so far only applied it to math/algorithms puzzles from Project Euler. We gave participants access to a Google doc and an in-browser IDE. You can preview the interface.

In total 98 users signed up. There were 281 different attempts on 40 different problems, totaling to about 47 hours of time spent solving problems. 25 problems had at least five attempts. 4 problems were successfully solved.

As with Mosaic, we've mostly been learning how to run informative experiments.

What have we learned?

Iteration in teams: Most of our Relay experiments ran with a larger participant pool than Mosaic, and with fewer repeat participants. This caused a number of challenges:
1. Doing well as a group requires that individuals don't try to make as much object-level progress as possible, but rather make small contributions and improve the situation the next participant is in. It tends to take participants several attempts at Relay problems before they get a feel for this.
2. There is high variance in skill at Project Euler problems. Not all participants would be able to solve these problems even without factorization. This makes it more difficult to say whether failures are due to decomposition.
3. When other participants are strangers, it's difficult to be confident that bugs haven't been introduced earlier, and indeed bugs were introduced at times.
We are now running team-based experiments to address these issues.
Infrastructure improvements: Besides adding functionality for solving problems within small teams, Ben added a way for people to see which problems they participated in and to track progress after the fact, and a chat participants can use to discuss strategies.
Strategies: Here are three strategies that seemed helpful:
1. Spend some of your round planning for the next participant's round, organizing and simplifying the information the next players are exposed to. This includes restating the problem in simpler terms.
2. Create tests and checks for the correctness of subproblems.
3. Use a list of tasks, marking tasks that are good next steps and tasks that need to be made more concrete.
While these seem like reasonable ideas, we don't think we have have learned anything substantial yet about what alternative decomposition strategies are most promising.

You can participate at relay.ought.org.

Hiring and collaborations

Hiring

COO

We have continued our search for a COO. So far, we’ve considered about 60 candidates for the role, and have done onsites with 4 candidates. We’ve sourced candidates from our personal networks, 80,000 Hours, and inbound applications.

We have envisioned the COO as a crucial part of Ought’s leadership, while at the same time having major day-to-day responsibilities to keep the organization running well. For these reasons, we’ve held a very high bar for COO candidates across a wide range of competencies, including high-level research understanding and ability to substantially contribute to recruiting in addition to traditional operations. We are considering whether to hire for a more narrow operations role instead, such as Director of Operations or Operations Associate.

Other roles

In November, we hired Derek Elkins as a research engineer after considering about 70 candidates for the role. Derek is a strong Haskell programmer who has worked on research in functional and logic programming. He was referred to us by Edward Kmett at MIRI.

We've decided that we would like to hire someone whose main job is to run experiments, since doing experiments well is central to our success as an organization. We haven't created a job description for this role yet.

We've updated the job descriptions for the Senior Full-Stack Web Engineer and Researcher roles.

Going forward, we’re prioritizing active recruiting for the web engineer role, and we’ll be on the lookout for strong applicants for the researcher and experimenter roles.

If you’re interested in any of our roles, apply here!

Trial tasks

We created a private Github repository with trial tasks for operations and software engineering that we're sharing with other EA orgs. If this sounds interesting to you, get in touch.

Contractors

We started working with web developer Zak Miller, who has iterated on our Mosaic app to support our Mosaic experiments. We also began working with Ben Goldhaber, who has led our Relay experiments.

Ben West led our Research Engineer and COO hiring processes until he started working at the Center for Effective Altruism in November.

Collaborations

We're not currently actively working on ML projects at Ought, but we're excited to collaborate with interested academics on projects related to factored cognition. This document outlines the sorts of projects we'd be happy to collaborate on.

University of Toronto student Will Saunders visited our office in September, and is visiting again in January. We've started collaborating on a project related to using ML for amplification. If you're interested in this kind of project, get in touch - we're open to hosting academic visitors more frequently in the future.

Organization and funding

Most of our operations capacity has been aimed at recruiting, but we've also made some progress on general organizational improvements, including getting health insurance and other employee benefits.

We received a $225k FLI grant for factored cognition (over two years), and a $10k EA Long-Term Future Fund grant.

Outreach

Paul Christiano talked about Ought on this 80,000 hours podcast episode.

The Factored Cognition presentation was posted as part of the Iterated Amplification sequence on the Alignment Forum. I added a comment with changes I'd make based on what I learned since May 2018.

We've generally fallen short at communicating clearly what Ought does, and in particular feel that our web page is not a great representation of our plans. Our mission has been and still is to find scalable ways to use ML for supporting and automating deliberation, but we want to be clearer about the degree to which we're a research lab vs. aimed at a product, and about how we relate to AI alignment and to other organizations in the space. We hope to remedy this over the coming months.

Plans

Going forward, our priorities are:

Run more multi-user experiments. Get to a point where we can gather substantial evidence on the feasibility of factored cognition.
Continue our foundations research program, integrating reflection, laziness, distillation, speculative execution, and scheduling via question-answering into a single prototype.
Over time, consolidate Mosaic, Affable, and potentially Relay into a single app.
Fill our open roles: COO, web developer, and experimenter.

Progress Update July 2018

Andreas Stuhlmüller — Mon, 16 Jul 2018 00:00:00 GMT

This is an update on our progress towards our goals over the last six months. Briefly:

We implemented two prototypes for our Factored Cognition project.
We presented our work at CHAI, FHI, and a FHI/Deepmind seminar.
We published a tech report on Predicting Slow Judgments.
We hired Ben Rachbach as Interim Head of Operations.
We got 501(c)(3) status from the IRS.
We received a grant from the Open Philanthropy Project.

Table of Contents

Research and implementation

Factored Cognition

In our Factored Cognition project, we explore whether we can answer difficult questions by decomposing the cognitive work involved into small pieces that can be tackled by individual agents who don't know the big picture.

Conceptual work

In March, we published a review of technical background for Factored Cognition. The writeup presents a taxonomy of approaches to decomposing cognitive work. At the end of the writeup, we picked one of the approaches as a tentative implementation target, and we sketched a plan to experiment with the implementation in order to answer our key open questions about Factored Cognition.

Implementation

Since March, we have implemented and open-sourced two prototypes, Mosaic and Patchwork.

Mosaic is a web app built by Ozzie Gooen and Andrew Schreiber that supports creating and editing recursive question-answer trees with pointers. It doesn't support automation, doesn't currently schedule work between multiple users, and still has a number of usability issues. You can take a look at our running instance or check out the source code.

Patchwork was built by Ben Weinstein-Raun and is a command-line app aimed at developing robust foundations for the next version of Mosaic, with principled support for multiple users and extensive automation. The repository includes a short screencast. Like Mosaic, it is a MIT-licensed open-source project.

By implementing Patchwork, we developed a better understanding of the design challenges that face systems for Factored Cognition. The main open challenges we see right now are to:

Implement reflection that can capture all actions, including pointer expansions.
Better understand budgets and how they should interact with cache-based automation.
Understand how laziness and exception handling should interact. Investigate speculative execution based on predicted responses.
Figure out how to simulate edits to questions in a user-friendly fashion if questions and pointer values are immutable in the underlying system, or decide not to make them immutable.

Our work on Patchwork also helped us reflect on Mosaic. We now think that perhaps we should have held off on developing a GUI app like Mosaic until we have worked out the relevant concepts using a command-line app like Patchwork. As it stands, we'll need to rewrite large parts of Mosaic once we're happy with our understanding of the foundations. However:

Mosaic has been helpful for early informal testing, and we have used it extensively to visualize trees for presentations.
It would have been difficult to find strong collaborators for extensive work on something like Patchwork.

Still, we've updated towards focusing more on conceptual development before implementation (perhaps through writing "throwaway" code).

Experiments

We haven't done serious experimentation or testing of Factored Cognition yet, only informal single-user experiments using Mosaic. That said:

We haven't yet encountered questions where it seems impossible to make progress through decomposition, so we’re slightly more optimistic that Factored Cognition can work for a very wide range of tasks.
We now have a more visceral sense that, for many problems, decomposition will require a very large amount of work, so we often won't be able to instantiate complete question-answer trees explicitly and will instead need to "distill" the question-answer behavior of entire sub-trees using ML. For example, we think that this will be the case for the most natural approaches to belief updating, language understanding, and math proofs.
We think there's a good chance that decomposition is a lot more difficult with multiple users, since each user lacks context on the parts of the tree that they haven’t seen.

We still plan to delay extensive experimentation until we have built a web app that supports cache-based automation. However, we think that we should have run basic multi-user experiments by now. We could have a human manager distribute tasks between users to run a multi-user experiment with the current version of Mosaic. Such an experiment should provide cheap evidence on how much harder decomposition with true isolation is, compared to a single person doing the entire recursive breakdown. We are planning to run such experiments with about five participants in the next two months.

Outreach

We presented the Factored Cognition project at CHAI, FHI, and a FHI/Deepmind seminar. For people interested in AI alignment, our annotated slides are probably the best introduction to our work right now.

Predicting Slow Judgments

In our Predicting Slow Judgments project, we explore whether we can use machine learning algorithms to make well-calibrated predictions of human judgments in AI-complete domains.

Tech report

Together with our collaborators at FHI (Owain Evans, Chris Cundy, Tom McGrath, Ryan Carey, Zac Kenton), we published a tech report (pdf) about the project. This report is based on our online experiment ThinkAgain, where we presented participants with statements about politics and Fermi estimation and asked them to judge the probability that each statement is true. We had them make multiple judgements per statement, giving them more time to think about their judgement with each iteration. Thanks to everyone who participated!

We used the data from participants to explore the ML problem of predicting expensive (slow) judgments in cases where most of our training data consists of cheap (fast) proxies. We used standard collaborative filtering algorithms, neural collaborative filtering, and Bayesian hierarchical regression.

Dataset

Our data and models are available in this git repository. The dataset we collected has issues that make it difficult to explore the use of ML for making well-calibrated predictions of slow judgements based mostly on fast judgements. These issues include:

While slow judgments were significantly more accurate than fast judgments, the difference was smaller than intended.
Variability among subjects is difficult to distinguish from noise, so ML algorithms could not exploit similarities among users as in collaborative filtering.
While users clearly found some questions more difficult than others, this variation is very hard for current ML algorithms to exploit.

Ryan and Tom attempted to manually predict users’ slow judgements based on the judgements available as part of the training set in order to learn how close to human-level our algorithms are. They got results comparable to a simple predictive models and qualitatively found the task to be very challenging. This provides some evidence that the dataset doesn't contain interesting patterns that humans could easily notice but ML algorithms can't.

Retrospective

In the end, we invested a large amount of labor for relatively little scientific gain (for example, about 400 hours of Andreas's time). Ryan surveyed the other team members post-mortem and the top issues (paraphrased) were:

Objectives weren't sufficiently clear, especially at onboarding.
Progress and value of the project weren't evaluated sufficiently often; there were no clear decision points to reconsider whether to keep going with the project or abandon it.

In our next academic collaboration, we'll discuss the project's goals and relation to the bigger picture during onboarding in more depth. We are also planning to do explicit monthly re-evaluations to determine whether the project should continue on its current course, change course, or be cancelled altogether.

We still feel that the ML problem behind this project—making well-calibrated predictions in AI-complete domains with infrequent direct supervision—is important, but plan to hold off on further work until we can do it in the context of data generated as part of our Factored Cognition project. A key long-term goal for Predicting Slow Judgments has always been to contribute to automating cognitive actions in that setting. If we can directly use that data and setup, we feel more confident that the things we learn will contribute to our organizational goals. If we use other settings—such as Fermi and Politifact probability judgments—as proxies, there is always a risk that the insights and solutions won't generalize to the setting we most care about.

Hiring

Besides research, expanding our team has been our main focus. Over the past several months, we’ve been hiring for full-stack web developers, research engineers, researchers, and a chief operating officer.

We have not yet hired full-time employees for any of these roles. However, we have worked with contractors and a part-time employee to cover some of the responsibilities of these roles, and we have made progress towards hiring full-time employees.

As discussed above, we worked with contractors Ozzie Gooen, Andrew Schreiber, and Ben Weinstein-Raun to develop our research apps. We also hired Ben Rachbach as Interim Head of Operations. He previously worked as a software engineer at Wonder Workshop and as a research analyst at GiveWell. This will allow us to take time and be selective in filling our COO role.

An advantage of working with contractors has been that we’ve been able to hire them for work in their area of comparative advantage, then let them move on to other high-impact work once we have no more tasks suitable for them. For example, Ozzie architected and created the first version of our app Mosaic, and he’s now working on other projects until we need more help architecting web apps.

A disadvantage of working with contractors has been that we haven’t been able to build a stable, predictable team dedicated to Ought’s success in the long term.

In the past two months, we have focused mainly on hiring a COO and a senior research engineer who could architect the apps that we use for our research. We are still accepting applications for these roles, and we encourage interested people to apply!

For our research engineering role, we originally placed some emphasis on prior ML or web development experience. However, we have realized that in fact, general computer science and software engineering background are the most important experience required for success in the role. We are looking for candidates with experience creating good abstractions, who have functional programming experience, who may have built interpreters or compilers, and who generally have substantial CS background (as might be acquired by taking courses on design and theory of algorithms and programming languages).

In total, we have considered 117 potential hires so far (including leads we decided not to reach out to and candidates who are still moving through our hiring pipeline). We have created short trial tasks for both operations and engineering. Ben West has been helping us with the COO recruiting process.

We believe that the main reason that we have not hired full-time employees for the roles above yet is that we had not been able to spend much time on hiring until recently. We’ve set a high bar for hiring people full-time, so it has taken a good deal of effort on our part to find and evaluate candidates who might meet our bar. With Ben Rachbach and Ben West devoting substantial time to hiring, we’ve recently been able to reach a number of interested and potentially qualified applicants and move them through the hiring process.

Organization and funding

The IRS approved our application for 501(c)(3) status, which means that donations are now tax-deductible. We're not actively seeking donations at the moment and are primarily talent-constrained.

The Open Philanthropy Project made a $525k grant to us. You can read their grant report.

We moved out of a coworking space and into our own office in North Beach, San Francisco. We now have enough space to host visiting researchers working on topics relevant to our mission.

Together with Owain Evans, we submitted a grant application to FLI. If accepted, we'll use the funds to support interns at FHI who want to collaborate on our projects.

Since hiring Ben Rachbach, we have been making our way through a backlog of operations tasks such as buying insurance, setting up our office environment, and documenting and formalizing our processes.

Plans

Going forward, our priorities are:

Run basic multi-user experiments with about five participants using Mosaic, simulating automated task scheduling using a human manager.
Fill the remaining conceptual holes in Patchwork and develop the understanding required to build a robust multi-user app for question decomposition that is a good fit for extensive automation.
Build a web app for Factored Cognition that builds on the concepts developed in Patchwork. Then make progress on our project plan.
Fill our open engineering, research, and COO positions. If you're interested in working with us, get in touch!

Ought

Ought has spun off Elicit

AI Safety Needs Great Product Builders

Audience

But I thought I would need a PhD!

Example projects, and why they’re important

For software engineers

For product managers

For infrastructure engineers

Why would you do this?

Impactful work

The people are fantastic

Talking of paychecks…

Interesting, challenging work every day

What to do now?

A Library and Tutorial for Factored Cognition with Language Models

Interactive Composition Explorer (ICE)

Factored Cognition Primer

How to use Elicit responsibly

Most important takeaways

Elicit helps with, but does not automate, literature reviews

Elicit is only as good as the research it uses

Double-check Elicit's work

The team building Elicit

How Elicit works

Elicit's limitations

Limitations specific to Elicit

Limitations that apply to research or search tools in general

Other thoughts on limitations

Suggestions for how to relate to Elicit

Get a broader perspective than you could have otherwise

Find papers that you may not have found elsewhere

Figure out where to drill in

Overall reflections

More reading

The Plan for Elicit

How we think about success

Progress in 2021

Start with research

Support broad literature reviews

Establish a user base

Build infrastructure for process-based ML

Running complex task pipelines

Finetuning individual tasks

Roadmap for 2022+

Evaluate papers in depth through decomposition

Elicit factors a question

Elicit factors a research process

Support many research workflows

Refine the primitive tasks

Expand our infrastructure for process-based ML

Run more complex task pipelines

Add new tasks with little effort

Efficiently gather human demonstrations and feedback

From research assistant to reasoning assistant

Supervise Process, not Outcomes

The spectrum

Supervising outcomes

Supervising process

In between process and outcomes

It’s better to supervise process than outcomes

Differential capabilities: Supervising process helps with long-horizon tasks

Alignment: Supervising process is safety by construction

In the long run, differential capabilities and alignment converge

Two attractors: The race between process- and outcome-based systems

Outcome-based optimization is an attractor

Process-based optimization could be an attractor, too

The state of the race

Conclusion

Appendix

The crazy future

Comments

Acknowledgments

Building Elicit, the AI research assistant

What Elicit looks like

Our plans for Elicit

Recent work

The case for Elicit

Automating reasoning about the future at Ought

Judgmental forecasting today