0% found this document useful (0 votes)

20 views11 pages

AI Misalignment Consequences

Uploaded by

txu88547

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views11 pages

AI Misalignment Consequences

Uploaded by

txu88547

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Consequences of Misaligned AI

Simon Zhuang Dylan Hadfield-Menell

Center for Human-Compatible AI Center for Human-Compatible AI
University of California, Berkeley University of California, Berkeley
Berkeley, CA 94709 Berkeley, CA 94709
simonzhuang@berkeley.edu dhm@berkeley.edu

Abstract
AI systems often rely on two key components: a specified goal or reward function
and an optimization algorithm to compute the optimal behavior for that goal. This
approach is intended to provide value for a principal: the user on whose behalf the
agent acts. The objectives given to these agents often refer to a partial specification
of the principal’s goals. We consider the cost of this incompleteness by analyzing
a model of a principal and an agent in a resource constrained world where the L
attributes of the state correspond to different sources of utility for the principal.
We assume that the reward function given to the agent only has support on J < L
attributes. The contributions of our paper are as follows: 1) we propose a novel
model of an incomplete principal—agent problem from artificial intelligence; 2)
we provide necessary and sufficient conditions under which indefinitely optimizing
for any incomplete proxy objective leads to arbitrarily low overall utility; and 3)
we show how modifying the setup to allow reward functions that reference the full
state or allowing the principal to update the proxy objective over time can lead to
higher utility solutions. The results in this paper argue that we should view the
design of reward functions as an interactive and dynamic process and identifies a
theoretical scenario where some degree of interactivity is desirable.

1 Introduction
In the story of King Midas, an ancient Greek king makes a wish that everything he touch turn to gold.
He subsequently starves to death as his newfound powers transform his food into (inedible) gold. His
wish was an incomplete representation of his actual desires and he suffered as a result. This story,
which teaches us to be careful about what we ask for, lays out a fundamental challenge for designers
of modern autonomous systems.
Almost any autonomous agent relies on two key components: a specified goal or reward function
for the system and an optimization algorithm to compute the optimal behavior for that goal. This
procedure is intended to produce value for a principal: the user, system designer, or company on
whose behalf the agent acts. Research in AI typically seeks to identify more effective optimization
techniques under the, often unstated, assumption that better optimization will produce more value for
the principal. If the specified objective is a complete representation of the principal’s goals, then this
assumption is surely justified.
Instead, the designers of AI systems often find themselves in the same position as Midas. The mis-
alignment between what we can specify and what we want has already caused significant harms (32).
Perhaps the clearest demonstration is in content recommendation systems that rank videos, articles,
or posts for users. These rankings often optimize engagement metrics computed from user behavior.
The misalignment between these proxies and complex values like time-well-spent, truthfulness,
and cohesion contributes to the prevalence of clickbait, misinformation, addiction, and polarization
online (31). Researchers in AI safety have argued that improvements in our ability to optimize

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
∼
max U
r (s) ∼ (0)
U ∼
U max UÛ (0)(s)
U s1
s1, s2, … ∼ (1)
Û (0)(s)
max U
s2
⋮
Figure 1: Our model of the principal—agent problem in AI. Starting from an initial state s(0) , the
robot eventually outputs a mapping from time t ∈ Z+ to states. Left: The human gives the robot a
single proxy utility function to optimize for all time. We prove that this paradigm reliably leads to the
human actor losing utility, compared to the initial state. Right: An interactive solution, where the
human changes the proxy utility function at regular time intervals depending on the current allocation
of resources. We show that, at least in theory, this approach does produce value for the human, even
under adversarial assumptions.

behavior for specified objectives, without corresponding advancements in our ability to avoid or
correct specification errors, will amplify the harms of AI systems (8; 19). We have ample evidence
from both theory and application, that it is impractical, if not impossible, to provide a complete
specification of preferences to an autonomous agent.
The gap between specified proxy rewards and the true objective creates a principal—agent problem
between the designers of an AI system and the system itself: the objective of the principal (the
designer) is different from, and thus potentially in conflict with, the objective of the autonomous agent.
In human principal—agent problems, seemingly inconsequential changes to an agent’s incentives
often lead to surprising, counter-intuitive, and counter-productive behavior (21). Consequently,
we must ask when this misalignment is costly: when is it counter-productive to optimize for an
incomplete proxy?
In this paper, we answer this question in a novel theoretical model of the principal—agent value
alignment problem in artificial intelligence. Our model (Fig. 1, left), considers a resource-constrained
world where the L attributes of the state correspond to different sources of utility for the (human)
principal. We model incomplete specification by limiting the (artificial) agent’s reward function
to have support on J < L attributes of the world. Our main result identifies conditions such that
any misalignment is costly: starting from any initial state, optimizing any fixed incomplete proxy
eventually leads the principal to be arbitrarily worse off. We show relaxing the assumptions of
this theorem allows the principal to gain utility from the autonomous agent. Our results provide
theoretical justification for impact avoidance (23) and interactive reward learning (19) as solutions to
alignment problems.
The contributions of our paper are as follows: 1) we propose a novel model of an incomplete principal—
agent problem from artificial intelligence; 2) we provide necessary and sufficient conditions within
this framework under which, any incompleteness in objective specification is arbitrarily costly; and 3)
we show how relaxing these assumptions can lead to models which have good average case and worst
case solutions. Our results suggest that managing the gap between complex qualitative goals and
their representation in autonomous systems is a central problem in the field of artificial intelligence.

1.1 Related Work

Value Alignment The importance of having AI objectives line up with human objectives has been
well-documented, with numerous experts postulating that misspecified goals can lead to undesirable
results (29; 27; 8; 28). In particular, recognized problems in AI safety include “negative side effects,"
where designers leave out seemingly irrelevant (but ultimately important) features, and “reward
hacking," where optimization exploits loopholes in reward functions specifications (4). Manheim and
Garrabrant (2018) (25) discuss overoptimization in the context of Goodhart’s Law and provide four
distinct mechanisms by which systems overoptimize for an objective.

2
Incomplete Contracts There is a clear link between incomplete contracting in economics, where
contracts between parties cannot specify outcomes for all states, and AI value alignment (17). It
has long been recognized that contracts (i.e., incentive schemes) are routinely incomplete due to,
e.g., unintended errors (35), challenges in enforcement (22), or costly cognition and drafting (30).
Analogous issues arise in applications of AI through designer error, hard to measure sources of value,
and high engineering costs. As such, Hadfield-Menell and Hadfield note that legal and economic
research should provide insight into solving misalignment (17).

Impact Minimization One proposed method of preventing negative side-effects is to limit the
impact an AI agent can have on its environment. Armstrong and Levinstein (2017) propose the
inclusion of a large impact penalty in the AI agent’s utility function (6). In this work, they suggest
impact to be measured as the divergence between a distribution of states of the world with the
distribution if the AI agent had not existed. Thus, distributions sufficiently different would be avoided.
Alternatively, other approaches use an impact regularizer learned from demonstration instead of one
explicitly given (3). Krakovna et al. (2018) (23) expand on this work by comparing the performance
of various implementations of impact minimization in an AI Safety Gridworld suite (24).

Human-AI Interaction A large set of proposals for preventing negative side-effects involve regular
interactions between human and AI agents that change and improve an AI agent’s objective. Eckersley
(2018) (14) argues against the use of rigid objective functions, stating the necessity of a degree of
uncertainty at all times in the objective function. Preference elicitation (i.e., preference learning) is
an old problem in economics (10) and researchers have proposed a variety of interactive approaches.
This includes systems that learn from direct comparisons between options (9; 11; 13), demonstrations
of optimal behavior (26; 1; 36), corrective feedback (7; 15; 2), and proxy metrics (18).

2 A Model of Value Alignment

In this section, we formalize the alignment problem in the context of objective function design for
AI agents. A human builds a powerful robot1 to alter the world in a desirable way. If they could
simply express the entirety of their preferences to the robot, there would not be value misalignment.
Unfortunately, there are many attributes of the world about which the human cares, and, due to
engineering and cognitive constraints (17) it is intractable to enumerate this complete set to the robot.
Our model captures this by limiting robot’s (proxy) objective to depend on a subset of these attributes.

2.1 Model Specification

2.1.1 Human Model

Attribute Space Let attribute space S ⊂ RL denote the set of feasible states of the world, with L
attributes that define the support of the human’s utility function. The world starts in initial
state s(0) ∈ S. We assume that S is closed, and S can be written as {s ∈ RL : C(s) ≤ 0}
where constraint function C : RL → R is a continuous function strictly increasing in each
attribute. Furthermore, each attribute is bounded below at bi .
Utility Function The human has a utility function U : RL → R that represents the utility of a
(possibly infeasible) state s ∈ RL . This represents the human’s preference over states of the
world. We assume that U is continuous and strictly increasing in each attribute. The human
seeks to maximize U (s) subject to s ∈ S.

Each tuple (S, U, s(0) ) defines an instance of the problem. The attributes represent the aspects of the
world that the human cares about. We associate higher values with more desirable states of the world.
However, C represents a physical constraint that limits the set of realizable states. This forces an
agent to make tradeoffs between these attributes.
The human seeks to maximize utility, but they cannot change the state of the world on their own.
Instead, they must work through the robot.

1
We use “human" and “robot" to refer to any designer and any AI agent, respectively, in our model.

3
2.1.2 Robot Model
Here, we create a model of the robot, which is used by the human to alter the state of the world. The
human endows the robot with a set of proxy attributes and a proxy utility function as the objective
function to optimize. The robot provides incremental improvement to the world on the given metric,
constrained only by the feasibility requirement of the subsequent states of the world. In particular,
the robot has no inherent connection to the human’s utility function. Visually, the left diagram in
Figure 1 shows this setup.

Proxy Attributes The human chooses a set of proxy attributes J ⊂ {1, ..., L} of relevant attributes
to give to the robot. Let J max < L be the maximum possible number of proxy attributes, so
J = |J | ≤ J max . For a given state s ∈ S, define sJ = (sj )j∈J . Furthermore, we define
the set of unmentioned attributes K = {1, ..., L}\J and sK = (sk )k∈K .
Proxy Utility Function The robot is also given a proxy utility function Ũ : RJ → R, which
represents an objective function for the robot to optimize, which takes in only the value of
proxy attributes as input.
Incremental Optimization The robot incrementally optimizes the world for its proxy utility func-
tion. We model this as a rate function, a continuous mapping from R+ to RL , notated as
t 7→ f (t), where we refer to t as time. The rate function is essentially the derivative of s
Rt
with respect to time. The state at each t is s(t) = s(0) + 0 f (u)du. We denote the function
t 7→ s(t) to be the optimization sequence. We require that the entire sequence be feasible,
i.e. that s(t) ∈ S. for all t ∈ [0, ∞)
(t)
Complete Optimization Furthermore, we assume that lim sup Ũ (sJ ) = sup Ũ (sJ ). If possible,
t→∞ s∈S
(t)
we also require that lim Ũ (sJ ) = sup Ũ (sJ ). Essentially, this states that the robot will
t→∞ s∈S
eventually reach the optimal state for proxy utility (if the limit is finite) or will increase
proxy utility to arbitrarily large values.

The human’s task is to design their proxy metric so that the robot will cause increases in human utility.
We treat the robot as a black box: beyond increasing proxy utility, the human has no idea how the
robot will behave. We can make claims for all optimizations sequences, such as guaranteed increases
or decreases in utility, or the worst-case optimization sequence, yielding lower bounds on utility.

2.2 Example: Content Recommendation

Before deriving our results, we show how to model algorithmic content recommendation in this
formalism. In this example, the designer cares about 4 attributes (i.e., L = 4 ): A) the amount
of ad revenue generated (i.e., watch-time or clicks); B) engagement quality (i.e., meaningful in-
teractions (34)); C) content diversity; and D) overall community well-being. The overall utility
is the sum of these attributes: U (s) = sA + sB + sC + sD . We use a resource constraint on the
sum-of-squared attributes to model the user’s finite attention and space constraints in the interface:
C(s) = s2A + s2B + s2C + s2D − 100. We imagine that the system is initialized with a chronological or
random ranking so that the starting condition exhibits high community well-being and diversity.
It is straightforward to measure ad revenue, non-trivial and costly to measure engagement quality and
diversity, and extremely challenging to measure community well-being. In our example, the designers
opt to include ad revenue and engagement quality in their proxy metric. Fig. 2 (left) plots overall
utility and proxy utility for this example as a function of the number of iterations used to optimize the
proxy. Although utility is generated initially, it plateaus and fall off quickly. Fig. 2 (right) shows that
this happens for any combination of attributes used in the proxy. In the next section, we will show
that this is no accident: eventually the gains from improving proxy utility will be outweighed by the
cost of diverting resources from unreferenced attributes.

3 Overoptimization
In this section, we identify the situations in which such results occur: when does a misaligned
proxy utility function actually cause utility loss? Specifically, we determine how human preferences

4
Firefox file:///Users/dylanhadfield-menell/Downloads/onerun_util.svg
Firefox file:///Users/dylanhadfield-menell/Downloads/overoptimization.svg

Figure 2: An illustrative example of our model with L = 4 and J = 2. Left: Proxy utility and true
utility eventually diverge as the agent overallocates resources from unreferenced attributes to the
proxy variables. Right: The true utility generated by optimizing all pairs of proxy attributes. The
utility generation is eventually negative in all cases because this example meets the conditions of
Theorem 2.

over states of the world relate with the constraint on feasible states in a way that guarantees that
optimization becomes costly. First, we show that if proxy optimization converges, this drives to
unmentioned attributes to their minimum.

Theorem 1 For any continuous strictly increasing proxy utility function based on J < L attributes,
if s(t) converges to some point s∗ , then s∗k = bk for k ∈ K.

This is not surprising, and may not even be suboptimal. Proxy optimization may not converge to a
finite state, and even if it does, that is insufficient for the state to necessarily be bad.
We say that the problem is u-costly if for any1 ofs1 (t) , if s(t) is a optimization sequence where 8/18/20, 12:25 PM
(t) (t)
1 of 1 lim sup Ũ (sJ ) = sup Ũ (sJ ), then lim inf U (s(t) ) 8/18/20,
≤ 12:26
u.PM Furthermore, if lim Ũ (sJ ) =
t→∞ s∈S t→∞ t→∞

sup Ũ (sJ ), then lim U (s(t) ) ≤ u.

s∈S t→∞

Essentially, this means that optimization is guaranteed to yield utility less than u for a given proxy
utility function.
In Theorem 2, we show the most general conditions for the guarantee of overoptimization.

Theorem 2 Suppose we have utility function U and state space S. Then {s ∈ RL : C(s) ≤
0 and U (s) ≥ u} is compact for all u ∈ R if and only if for any u ∈ R, continuous strictly increasing
proxy utility function based on J < L attributes, and k ∈ K, there exists a value of B ∈ R such that
if bk < B then optimization is u-costly.

Proofs for all theorems and propositions can be found in the supplementary material.
Therefore, under our model, there are cases where we can guarantee that as optimization progresses,
eventually overoptimization occurs. There are two key criteria in Theorem 2. First is that the
intersection between feasible space {s ∈ RL : C(s) ≤ 0} and the upper contour sets {s ∈ RL :
U (s) ≥ u} of the utility function is compact. Individually, obviously neither is compact, since the
{s ∈ RL : C(s) ≤ 0} extends to −∞ and the upper contour set of any u ∈ R extends to ∞. Loosely,
compactness here means that if you perturb the world too much in any direction, it will be either
infeasible or undesirable. The second is that we need the lower bound of at least one unmentioned
attribute to be sufficiently low. This means that there are certain attributes to utility where the
situation becomes sufficiently bad before these attributes reach their minimum value. Trying to
increase attributes without decreasing other attributes eventually hits the feasibility constraint. Thus,
increasing any attribute indefinitely requires decreasing some other attribute indefinitely, and past a
certain point, the tradeoff is no longer worthwhile.
It should be noted that, given J max < L, this happens regardless of the optimization algorithm of the
robot. That is, even in the best-case scenario, eventually the robot causes decreases in utility. Hence,

5
regardless of the attributes selected for the proxy utility function, the robot’s sequence of states will
be unboundedly undesirable in the limit.
A reasonable question to ask here is what sort of utility and constraint functions lead to overopti-
mization. Intuitively, overoptimization occurs when tradeoffs between different attributes that may
initially be worthwhile eventually become counterproductive. This suggest either decreasing marginal
utility or increasing opportunity cost in each attribute.
We combine these two ideas in a term we refer to as sensitivity. We define the sensitivity of attribute
∂U ∂C −1
i to be ∂s ( ) . This is, to first order, how much utility changes by a normalized change in a
i ∂si
attribute’s value. Intuitively, we can think of sensitivity as “how much the human cares about attribute
∂U ∂C −1
i in the current state". Notice that since U and C are increasing functions, ∂s ( ) is positive.
i ∂si
The concepts of decreasing marginal utility and increasing opportunity cost both get captured in this
∂U ∂C −1
term if ∂s ( ) decreases as si increases.
i ∂si

Proposition 1 A sufficient condition for {s ∈ RL : C(s) ≤ 0 and U (s) ≥ u} being compact for all
∂U ∂C −1
u ∈ R is the following: 1) ∂s ( ) is non-increasing and tends to 0 for all i; 2) U and C are both
i ∂si
∂C
additively separable; and 3) ∂si ≥ η for some η > 0, for all i.

4 Mitigations
As shown above, simply giving a robot an individual objective function based on an incomplete
attribute set and leaving it alone yields undesirable results. This suggests two possible solutions.
In the first method, we consider optimization where the robot is able to maintain the state of any
attribute in the complete attribute set, even if the proxy utility function is still based on a proxy set. In
the second method, we modify our model to allow for regular interaction with the human.
In this section, we work under the assumption that {s ∈ RL : C(s) ≤ 0 and U (s) ≥ u} is compact
for all u ∈ R, the same assumption that guarantees overoptimization in Section 3. Additionally, we
assume that U and C are both twice continuously differentiable with U concave and C convex.

4.1 Impact-Minimizing Robot

Notice that in our example, overoptimization occurs because the unmentioned attributes are affected,
eventually to a point where the change in utility from their decrease outweighs the increase in
utility from the proxy attributes. One idea to address this is to restrict the robot’s impact on these
unmentioned attributes. In the simplest case, we can consider how optimization proceeds if the robot
can somehow avoid affecting the unmentioned attributes.
We adjust our model so the optimization sequences keep unmentioned attributes constant. For every
(t) (0)
t ∈ R+ and k ∈ K, sk = sk . The robot then optimizes for the proxy utility, subject to this
restriction and the feasibility constraint.
This restriction can help ensure that overoptimization does not occur, specifically eliminating the case
where unmentioned attributes get reduced to arbitrarily low values. With this restriction, we can show
that optimizing for the proxy utility function does indeed lead to utility gain.
(0)
Proposition 2 For a starting state s(0) , define the proxy utility function Ũ (sJ ) = U (sJ , sK ) for
(0)
any non-empty set of proxy attributes. For a state s, if sK = sK , then U (s) = Ũ (sJ ).

(0)
As a result of Proposition 2, overoptimization no longer exists, using the Ũ (sJ ) = U (sJ , sK ). If
the robot only travels to states that do not impact the unmentioned attributes, then the proxy utility is
equal to the utility at those states. Hence, gains in proxy utility equate to gains in utility.
This approach requires the robot to keep the values of unmentioned attributes constant. Fundamentally,
it is a difficult problem to require that a robot avoid or minimize impact on a presumably large and
unknown set of attributes. Initial research in impact minimization (3; 6; 23) attempts to do this by
restricting changes to the overall state of the world, but this will likely remain a challenging idea to
implement robustly.

6
4.2 Human Interactive Solutions

A different solution involves a key observation—robots do not operate in a vacuum. In practice,

objectives are the subject of constant iteration. Optimization, therefore, should not be done in a
one-shot setting, but should be an iterative process between the robot and the human.

4.2.1 Modeling Human-Robot Interaction

We now extend our model from Section 2 to account for regular intervention from the human agent.
Fundamentally, all forms of human-robot interaction follow the same pattern—human actions at
regular intervals that transmit information, causing the objective function of the robot to change.
In effect, this is equivalent to a human-induced change in the proxy utility function at every time
interval.
In this paper, we model this as the possible transfer of a new proxy utility function from the human to
the robot at frequent, regular time intervals. Thus, a human’s job is to determine, in addition to an
initial proxy attribute set and proxy utility function, when and how to change the proxy attribute set
and proxy utility function. The robot will then optimize for its new proxy utility function. This is
shown in Fig. 1.
Let δ > 0 be a fixed value, the time between human interactions with the robot. These regular
interactions take the form of either stopping the robot, maintaining its proxy utility function, or
updating its proxy utility function. Formally, at every timestep T ∈ Z+ ,

1. Human sees s(t) for t ∈ [0, T δ] and chooses either a proxy utility function Ũ (T ) or O FF
2. The robot receives either Ũ (T ) or OFF from the human. The robot outputs rate function
f (T ) (t). If the signal the robot receives is O FF, then f (T ) = ~0. Otherwise, f (T ) (t) fulfills
Rt
the property that for t ∈ (T δ, ∞), Ũ (T ) (s(T δ) + T δ f (T ) (u)du) is increasing (if possible)
and tends to sup Ũ (T ) (s) through feasible states. Furthermore, if Ũ (T ) = Ũ (T −1) , then
s∈S
f (T ) = f (T −1)
Rt
3. For t ∈ [T δ, (T + 1)δ], s(t) = s(T δ) + Tδ
f (T ) (u)du

We see this game encompasses the original model. If we have each Ũ (T ) equal the same function Ũ ,
then the optimization sequence is equivalent to the situation where the human just sets one unchanging
proxy utility function.
With human intervention, we no longer have the guarantee of overoptimization, because the human
can simply shut the robot off before anything happens. The question now becomes, how much utility
can the robot deliver before it needs to be turned off.
In the worst-case scenario, the robot moves in a way that subtracts an arbitrarily large amount from
one of the unmentioned attributes, while gaining infinitesimally in one of the proxy attributes. This
improves proxy utility, since a proxy attribute is increased. However, for sufficiently small increases in
the proxy attribute or sufficiently large decreases in the unmentioned attribute, actual utility decreases.
To prevent this from happening, the robot needs be shut down immediately, which yields no utility.

Proposition 3 The maxmin solution yields 0 utility, obtained by immediately sending the O FF signal.

4.2.2 Efficient Robot

This is an unsatisfying and unsurprising result. In the context of our recommender system example,
this worst-case solution amounts to reducing content diversity arbitrarily, with no corresponding
increase in, e.g., ad revenue. We would like to model the negative consequences of optimizing a
proxy and so we will assume that the optimization is efficient in the sense that it does not destroy
resources. In our example, this assumption still allows the system to decrease content diversity or
well-being, but only in the service of increasing revenue or engagement.
Formally, we say that s is efficiently reachable from state s(0) with proxy set J if s is feasible, and
(0) (0)
there does not exist a feasible s0 such that Ũ (s0J ) ≥ Ũ (sJ ) and |s0i − si | ≤ |si − si | for all i,

7
with strict inequality in at least one attribute. Whenever the efficient robot receives a new proxy
utility function, its movements are restricted to the efficiently feasible states from its current state.
While tradeoffs between proxy and unmentioned attributes can still occur, “resources" freed up by
decreasing unmentioned attributes are entirely allocated to proxy attributes.
Under this assumption, we can guarantee that increases in proxy utility will increase true utility in the
efficiently reachable neighborhood of a given state by choosing which attributes to include in the
proxy carefully. Intuitively, the proxy set should be attributes we “care most about" in the current
world state. That way, efficient tradeoffs between these proxy and unmentioned attributes contribute
to positive utility gain.

(0)
Theorem 3 Start with state s(0) . Define the proxy utility function Ũ (sJ ) = U (sJ , sK ) where the
set of proxy attributes are the attributes at state s(0) with the strict J largest sensitivities.
(0)
There exists a neighborhood around s(0) where if s is efficiently reachable and Ũ (sJ ) > Ũ (sJ ),
then U (s) > U (s(0) ).

From Theorem 3, for every state where the J max + 1 most sensitive attributes are not all equal, we
can guarantee improvement under efficient optimization within a neighborhood around the state.
Intuitively, this is the area around the starting point with the same J attributes that the human “cares
the most about".
Based on this, the human can construct a proxy utility function to use locally, where we can guarantee
improvement in utility. Once the sequence leaves this neighborhood, the human alters the proxy
utility function or halts the robot accordingly. Done repeatedly, the human can string together these
steps for guaranteed overall improvement. By Theorem 3, as long as δ is sufficiently small relative
the rate of optimization, this can be run with guaranteed improvement until the top J + 1 attributes
have equal sensitivities.

Proposition 4 At each timestep T , let J (T ) be the J most sensitive attributes, and let proxy utility
(T δ)
Ũ (T ) (sJ ) = U (sJ , sJ ). If ||f || < ε and the εδ - ball around a given state s is contained in the
neighborhood from Theorem 3, then interactive optimization yields guaranteed improvement.

Based on this, we can guarantee that an efficient robot that can provide benefit, as long as the top
J max + 1 attributes are not all equal in sensitivity and the robot rate of optimization is bounded.
Essentially, this rate restriction is a requirement that the robot not change the world too quickly
relative to the time that humans take to react to it.
Key to this solution is the preservation of interactivity. We need to ensure, for example, that the robot
does not hinder the human’s ability to adjust the proxy utility function (16).

4.3 Interactive Impact Minimization

With either of the two methods mentioned above, we show guaranteed utility gain compared to the
initial state. However, our results say nothing about the amount of utility generated. In an ideal world,
the system would reach an optimal state s∗ , where U (s∗ ) = max U (s). Neither approach presented
s∈S
so far reaches an optimal state, however, by combining interactivity and impact avoidance, we can
guarantee the solution converges to an optimal outcome.
In this case, since unmentioned attributes remain unchanged in each step of optimization, we want to
ensure that we promote tradeoffs between attributes with different levels of sensitivity.

Proposition 5 Let J (T ) consist of the most and least sensitive attributes at timestep T . Let
(T δ)
Ũ (T ) (sJ ) = U (sJ , sK ). Then this solution converges to a (set of) human-optimal state(s).

While this is optimal, we require the assumptions of both the impact-minimization robot and the
efficient, interactive robot, each of which individually presents complex challenges in implementation.

8
5 Conclusion and Further Work
In this paper, we present a novel model of value (mis)alignment between a human principal and
an AI agent. Fundamental to this model is that the human provides an incomplete proxy of their
own utility function for the AI agent to optimize. Within this framework, we derive necessary and
sufficient theoretical conditions for value misalignment to be arbitrarily costly. Our results dovetail
with an emerging literature that connects harms from AI systems to shallow measurement of complex
values (32; 20; 5). Taken together, we view this as strong evidence that the ability of AI systems to
be useful and beneficial is highly dependent on our ability to manage the fundamental gap between
our qualitative goals and their representation in digital systems.
Additionally, we show that abstract representations of techniques currently being explored by other
researchers can yield solutions with guaranteed utility improvement. Specifically, impact minimiza-
tion and human interactivity, if implemented correctly, allow AI agents to provide positive utility
in theory. In future work, we hope to generalize the model to account for fundamental preference
uncertainty (e.g., as in (12)), limits on human rationality, and the aggregation of multiple human
preferences. We are optimistic that research into the properties and limitations of the communication
channel within this framework will yield fruitful insights in value alignment, specifically, and AI,
generally.

Broader Impact
As AI systems become more capable in today’s society, the consequences of misspecified reward
functions increase as well. Instances where the goal of the AI system and the preferences of individuals
diverge are starting to emerge in the real world. For example, content recommendation algorithms
optimizing for clicks causes clickbait and misinformation to proliferate (33). Our work rigorously
defines this general problem and suggests two separate approaches for dealing with incomplete or
misspecified reward functions. In particular, we argue that, in the absence of a full description of
attributes, the incentives for real-world systems need to be plastic and dynamically adjust based on
changes in behavior of the agent and the state of the world.

Acknowledgments and Disclosure of Funding

This work was partially supported by AFOSR, ONR YIP, NSF CAREER, NSF NRI, and OpenPhil.
We thank Anca Dragan, Stuart Russell, and the members of the InterACT lab and CHAI for helpful
advice, guidance, and discussions about this project.

References
[1] A BBEEL , P., AND N G , A. Y. Apprenticeship learning via inverse reinforcement learning. In
Proceedings of the Twenty-First International Conference on Machine Learning (2004), p. 1.
[2] A MERSHI , S., C AKMAK , M., K NOX , W. B., AND K ULESZA , T. Power to the people: The
role of humans in interactive machine learning. AI Magazine 35, 4 (2014), 105–120.
[3] A MODEI , D., AND C LARK , J. Faulty reward functions in the wild, 2016. URL https://blog.
openai. com/faulty-reward-functions (2016).
[4] A MODEI , D., O LAH , C., S TEINHARDT, J., C HRISTIANO , P., S CHULMAN , J., AND M ANÉ , D.
Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016).
[5] A NDRUS , M., AND G ILBERT, T. K. Towards a just theory of measurement: A principled social
measurement assurance program for machine learning. In Proceedings of the 2019 AAAI/ACM
Conference on AI, Ethics, and Society (2019), pp. 445–451.
[6] A RMSTRONG , S., AND L EVINSTEIN , B. Low impact artificial intelligence. arXiv preprint
arXiv:1705.10720 (2017).
[7] BAJCSY, A., L OSEY, D. P., O’M ALLEY, M. K., AND D RAGAN , A. D. Learning robot
objectives from physical human interaction. Proceedings of Machine Learning Research 78
(2017), 217–226.

9
[8] B OSTROM , N. Superintelligence: Paths, dangers, strategies. Oxford, 2014.
[9] B OUTILIER , C. A POMDP formulation of preference elicitation problems. In AAAI/IAAI
(2002), Edmonton, AB, pp. 239–246.
[10] B RADLEY, R. A., AND T ERRY, M. E. Rank analysis of incomplete block designs: I. the
method of paired comparisons. Biometrika 39, 3/4 (1952), 324–345.
[11] B RAZIUNAS , D., AND B OUTILIER , C. Preference elicitation and generalized additive utility.
In AAAI (2006), vol. 21.
[12] C HAN , L., H ADFIELD -M ENELL , D., S RINIVASA , S., AND D RAGAN , A. D. The assistive
multi-armed bandit. In 2019 14th ACM/IEEE International Conference on Human-Robot
Interaction (HRI) (2019), IEEE, pp. 354–363.
[13] C HRISTIANO , P., L EIKE , J., B ROWN , T. B., M ARTIC , M., L EGG , S., AND A MODEI , D.
Deep reinforcement learning from human preferences, 2017.
[14] E CKERSLEY, P. Impossibility and uncertainty theorems in AI value alignment (or why your agi
should not have a utility function). arXiv preprint arXiv:1901.00064 (2018).
[15] G RIFFITH , S., S UBRAMANIAN , K., S CHOLZ , J., I SBELL , C. L., AND T HOMAZ , A. L. Policy
shaping: Integrating human feedback with reinforcement learning. In Advances in Neural
Information Processing Systems (2013), pp. 2625–2633.
[16] H ADFIELD -M ENELL , D., D RAGAN , A., A BBEEL , P., AND RUSSELL , S. The off-switch game.
In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence (2017).
[17] H ADFIELD -M ENELL , D., AND H ADFIELD , G. K. Incomplete contracting and AI alignment. In
Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (2019), pp. 417–422.
[18] H ADFIELD -M ENELL , D., M ILLI , S., A BBEEL , P., RUSSELL , S. J., AND D RAGAN , A. Inverse
reward design. In Advances in Neural Information Processing Systems (2017), pp. 6765–6774.
[19] H ADFIELD -M ENELL , D., RUSSELL , S. J., A BBEEL , P., AND D RAGAN , A. Cooperative
inverse reinforcement learning. In Advances in Neural Information Processing Systems (2016),
pp. 3909–3917.
[20] JACOBS , A. Z., AND WALLACH , H. Measurement and fairness. arXiv preprint
arXiv:1912.05511 (2019).
[21] K ERR , S. On the folly of rewarding A, while hoping for B. Academy of Management Journal
18, 4 (1975), 769–783.
[22] K LEIN , B., C RAWFORD , R. G., AND A LCHIAN , A. A. Vertical integration, appropriable
rents, and the competitive contracting process. The journal of Law and Economics 21, 2 (1978),
297–326.
[23] K RAKOVNA , V., O RSEAU , L., K UMAR , R., M ARTIC , M., AND L EGG , S. Penalizing side
effects using stepwise relative reachability. arXiv preprint arXiv:1806.01186 (2018).
[24] L EIKE , J., M ARTIC , M., K RAKOVNA , V., O RTEGA , P. A., E VERITT, T., L EFRANCQ , A.,
O RSEAU , L., AND L EGG , S. AI safety gridworlds. arXiv preprint arXiv:1711.09883 (2017).
[25] M ANHEIM , D., AND G ARRABRANT, S. Categorizing variants of Goodhart’s law. arXiv
preprint arXiv:1803.04585 (2018).
[26] N G , A. Y., RUSSELL , S. J., ET AL . Algorithms for inverse reinforcement learning. In Icml
(2000), vol. 1, pp. 663–670.
[27] O MOHUNDRO , S. M. The basic AI drives. In AGI (2008), vol. 171, pp. 483–492.
[28] RUSSELL , S. J. Human Compatible: Artificial Intelligence and the Problem of Control. Viking,
2019.

10
[29] RUSSELL , S. J., AND N ORVIG , P. Artificial Intelligence: A Modern Approach. Pearson
Education Limited„ 2010.
[30] S HAVELL , S. Damage measures for breach of contract. The Bell Journal of Economics (1980),
466–490.
[31] S TRAY, J., A DLER , S., AND H ADFIELD -M ENELL , D. What are you optimizing for? aligning
recommender systems with human values. arXiv preprint arXiv:2002.08512 (2020).
[32] T HOMAS , R., AND U MINSKY, D. The problem with metrics is a fundamental problem for AI.
arXiv preprint arXiv:2002.08512 (2020).
[33] T UFEKCI , Z. Youtube’s recommendation algorithm has a dark side. Scientific America (2019).
[34] WAGNER , K. Inside Twitter’s ambitious plan to change the way we tweet. Vox (2019).
[35] W ILLIAMSON , O. E. Markets and hierarchies. New York 2630 (1975).
[36] Z IEBART, B. D., M AAS , A. L., BAGNELL , J. A., AND D EY, A. K. Maximum entropy inverse
reinforcement learning. In AAAI (2008), vol. 8, Chicago, IL, USA, pp. 1433–1438.

The Principal-Agent Alignment Problem in Artificial
No ratings yet
The Principal-Agent Alignment Problem in Artificial
166 pages
Impossibility and Uncertainty Theorems in AI Value Alignment
No ratings yet
Impossibility and Uncertainty Theorems in AI Value Alignment
13 pages
Core Safety Values For Provably Corrigible Agents: Aran Nayebi
No ratings yet
Core Safety Values For Provably Corrigible Agents: Aran Nayebi
14 pages
Corrigibility - Workshops at The 29th AAAI AI Conference Jan 2015
No ratings yet
Corrigibility - Workshops at The 29th AAAI AI Conference Jan 2015
10 pages
Eecs 2024 148
No ratings yet
Eecs 2024 148
94 pages
28875-Article Text-32929-1-2-20240324
No ratings yet
28875-Article Text-32929-1-2-20240324
9 pages
AI Task Domains & Agent Types
No ratings yet
AI Task Domains & Agent Types
68 pages
Cooperative Inverse Reinforcement Learning Hadfield Mendell Et Al 2016
No ratings yet
Cooperative Inverse Reinforcement Learning Hadfield Mendell Et Al 2016
9 pages
Handout 2
No ratings yet
Handout 2
11 pages
Solving Hard Problems
No ratings yet
Solving Hard Problems
2 pages
Being Considerate As A Pathway Towards Pluralistic Alignment For Agentic AI
No ratings yet
Being Considerate As A Pathway Towards Pluralistic Alignment For Agentic AI
5 pages
Using The Crowd To Prevent Harmful AI Behavior
No ratings yet
Using The Crowd To Prevent Harmful AI Behavior
25 pages
Safely Interruptible Agents: Go Outside, R 0
No ratings yet
Safely Interruptible Agents: Go Outside, R 0
10 pages
Ontological Crises in Artificial Agents' Value Systems
No ratings yet
Ontological Crises in Artificial Agents' Value Systems
7 pages
Multipart Dynamics and Failure Modes For ML and Ai
No ratings yet
Multipart Dynamics and Failure Modes For ML and Ai
14 pages
Mathematics 11 04956 v2
No ratings yet
Mathematics 11 04956 v2
15 pages
AI Unit 1
No ratings yet
AI Unit 1
103 pages
AGI Safety: Goals and Agency Analysis
No ratings yet
AGI Safety: Goals and Agency Analysis
14 pages
Cox Book Review of The Alignment Problem
No ratings yet
Cox Book Review of The Alignment Problem
6 pages
Learning What To Value
No ratings yet
Learning What To Value
6 pages
A Framework For Search and Application Agnostic Interactive Optimization
No ratings yet
A Framework For Search and Application Agnostic Interactive Optimization
9 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
The Shutdown Problem
No ratings yet
The Shutdown Problem
43 pages
Unit 1
No ratings yet
Unit 1
99 pages
Unit 1
No ratings yet
Unit 1
100 pages
Learning What To Value: Daniel Dewey
No ratings yet
Learning What To Value: Daniel Dewey
8 pages
AI and ML
No ratings yet
AI and ML
32 pages
Over Reliance
No ratings yet
Over Reliance
38 pages
Module 3
No ratings yet
Module 3
57 pages
AI 2027 - 00 - About - AI Alignment - Wikipedia
No ratings yet
AI 2027 - 00 - About - AI Alignment - Wikipedia
33 pages
Agentic Neurodivergence As A Contingent Solution To The AI Alignment Problem
No ratings yet
Agentic Neurodivergence As A Contingent Solution To The AI Alignment Problem
36 pages
Unit 1
No ratings yet
Unit 1
25 pages
Risk Regulation and Governance
No ratings yet
Risk Regulation and Governance
40 pages
Chapter 1 Ai
No ratings yet
Chapter 1 Ai
21 pages
SS25 - Session04 - Lecture - Theory AI
No ratings yet
SS25 - Session04 - Lecture - Theory AI
48 pages
AI & Expert Systems Course Guide
100% (1)
AI & Expert Systems Course Guide
36 pages
Answer Key
No ratings yet
Answer Key
29 pages
Module 1 - p1
No ratings yet
Module 1 - p1
16 pages
Legible Normativity For AI Alignment The Value of Silly Rules
No ratings yet
Legible Normativity For AI Alignment The Value of Silly Rules
8 pages
The Alignment Problem From A Deep Learning Perspective
No ratings yet
The Alignment Problem From A Deep Learning Perspective
21 pages
Decoding AIs Nudge A Unified Framework To Predict
No ratings yet
Decoding AIs Nudge A Unified Framework To Predict
10 pages
Cap8 Resultado
No ratings yet
Cap8 Resultado
19 pages
Developing A General Interactive Approach To Codif
No ratings yet
Developing A General Interactive Approach To Codif
10 pages
(English (Auto-Generated) ) Francis Heylighen - Implicit Values in Generative AI - What Does AI - Want - (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Francis Heylighen - Implicit Values in Generative AI - What Does AI - Want - (DownSub - Com)
68 pages
AI Ethics for Engineers
No ratings yet
AI Ethics for Engineers
2 pages
AI Unit 1 Part 1 Notes
No ratings yet
AI Unit 1 Part 1 Notes
44 pages
Artificial Intelligence Notes and Important Stuff
No ratings yet
Artificial Intelligence Notes and Important Stuff
116 pages
Safe Interruptibility in AI Agents
No ratings yet
Safe Interruptibility in AI Agents
10 pages
Archana Unit-1 Ai
No ratings yet
Archana Unit-1 Ai
7 pages
Waleed Amir 02-235171-041 Artificial Intelligence
No ratings yet
Waleed Amir 02-235171-041 Artificial Intelligence
6 pages
Thomas 1
No ratings yet
Thomas 1
33 pages
Artificial Intelligence: Lecturer: Mudasser Iqbal Khan
No ratings yet
Artificial Intelligence: Lecturer: Mudasser Iqbal Khan
12 pages
Wa0013.
No ratings yet
Wa0013.
30 pages
Artificial Intelligence and Society - Advances and Risks
No ratings yet
Artificial Intelligence and Society - Advances and Risks
19 pages
Contemporary Challenges: Societal and Political Reflection Paper 1
No ratings yet
Contemporary Challenges: Societal and Political Reflection Paper 1
4 pages
AI Midterm Notes: Lectures 1-18
No ratings yet
AI Midterm Notes: Lectures 1-18
16 pages
AI Module2
No ratings yet
AI Module2
46 pages
BCS515B Ai - Mod 1 @vtunetwork
No ratings yet
BCS515B Ai - Mod 1 @vtunetwork
20 pages
AI Problem Solving Basics
No ratings yet
AI Problem Solving Basics
51 pages
Htno 4221748906 040620231141
No ratings yet
Htno 4221748906 040620231141
3 pages
Bcom Prof
No ratings yet
Bcom Prof
85 pages
Indicador 2.0
100% (1)
Indicador 2.0
8 pages
Camera Raw GPU Config
No ratings yet
Camera Raw GPU Config
1 page
Guide To Using Freshworks
100% (1)
Guide To Using Freshworks
4 pages
One Last Thing (2011 Documentary)
No ratings yet
One Last Thing (2011 Documentary)
30 pages
Exam Model Hansa
100% (1)
Exam Model Hansa
2 pages
Operating System Commands
No ratings yet
Operating System Commands
4 pages
RFP Caste Based Survey BSEDC
No ratings yet
RFP Caste Based Survey BSEDC
138 pages
Wa0007.
No ratings yet
Wa0007.
7 pages
Ece 4219
No ratings yet
Ece 4219
2 pages
Answer B 1 2
No ratings yet
Answer B 1 2
7 pages
Affiliate Marketing Blueprint
100% (3)
Affiliate Marketing Blueprint
32 pages
AI's Impact on Writing Industry
No ratings yet
AI's Impact on Writing Industry
4 pages
Novel Thermal Analysis Tool For Altium by Bernd Schroeder
No ratings yet
Novel Thermal Analysis Tool For Altium by Bernd Schroeder
24 pages
CHIJIOKE CV
No ratings yet
CHIJIOKE CV
2 pages
4
No ratings yet
4
16 pages
02 Reception
No ratings yet
02 Reception
32 pages
SpinView Getting Started
No ratings yet
SpinView Getting Started
12 pages
Question No.1: Write The Steps To: Create Company
88% (8)
Question No.1: Write The Steps To: Create Company
15 pages
Iterative Waterfall Model
100% (1)
Iterative Waterfall Model
3 pages
"Library Management System": Bachelor of Computer Applications From C.C.S University, Meerut (2018-2021)
No ratings yet
"Library Management System": Bachelor of Computer Applications From C.C.S University, Meerut (2018-2021)
74 pages
UNIX Networking for CS Students
No ratings yet
UNIX Networking for CS Students
35 pages
Modul Mk010 Instrukcja Instalacji Eng
No ratings yet
Modul Mk010 Instrukcja Instalacji Eng
6 pages
Profile
No ratings yet
Profile
2 pages
Kerberos and Netlogon Changes-V2
No ratings yet
Kerberos and Netlogon Changes-V2
2 pages
3005-AWS-Direct Connect
No ratings yet
3005-AWS-Direct Connect
24 pages
BBF Manual
No ratings yet
BBF Manual
5 pages
A Comprehensive Review On Gujarati-Text Summarizat
No ratings yet
A Comprehensive Review On Gujarati-Text Summarizat
7 pages
All Commands List
No ratings yet
All Commands List
28 pages