Strategic Experimentation With Poisson Bandits: Odfrey Eller Ven Ady
Strategic Experimentation With Poisson Bandits: Odfrey Eller Ven Ady
Sven Rady
Department of Economics, University of Munich
1. Introduction
When firms cooperate in a research joint venture, each faces a dynamic problem in
which it can perform repeated costly experiments (that is, spend time, effort, and money
on the purported innovation) but also learn from the experimental observations of the
Copyright © 2010 Godfrey Keller and Sven Rady. Licensed under the Creative Commons Attribution-
NonCommercial License 3.0. Available at http://econtheory.org.
DOI: 10.3982/TE595
276 Keller and Rady Theoretical Economics 5 (2010)
obtain the encouragement effect for this particular equilibrium only. In contrast, we are
able to establish this effect for all Markov perfect equilibria of the Poisson model.
We then show that there is no MPE in which all players use cutoff strategies, i.e., use
the risky arm exclusively when the probability they assign to the risky arm being good
is above some cutoff, and use the safe arm when it is below. In fact, the player who is
supposed to use the least optimistic cutoff in a purported MPE in cutoff strategies always
has an incentive to deviate to the safe action at the second least optimistic cutoff, where
one of the other players is supposed to switch action.
A symmetric MPE thus necessarily requires the players to choose an interior alloca-
tion of their resource at some beliefs. We show that the Poisson model admits a unique
symmetric MPE. As in its counterpart in the Bolton–Harris model, all players use the
risky arm exclusively when they are sufficiently optimistic, the safe arm when they are
sufficiently pessimistic, and both arms simultaneously at intermediate beliefs. Further,
the acquisition of information is slowed down so severely near the lower bound of the
intermediate range that the players’ beliefs cannot reach this bound in finite time.
This strongly suggests that asymmetric equilibria where a last experimenter keeps
the rate of information acquisition bounded away from zero at pessimistic beliefs ought
to be more efficient than symmetric equilibrium. Bolton and Harris (2000), who study
the undiscounted limit of the Brownian model, and Keller et al. (2005) confirm this by
constructing a variety of asymmetric MPE that dominate the symmetric MPE in terms
of aggregate payoffs. However, they do so in environments without the encouragement
effect. In both cases, it is the second of the two conditions stated above that fails. In the
undiscounted Brownian model, the catching-up criterion implies that best responses
do not depend on players’ continuation values, so there is no channel through which
the opponents’ future actions can influence a player’s current optimal choice. In the
exponential model, the only way for the pioneer to make the other players return to the
risky arm is to have a success himself, but since such a success is fully revealing, he will
then not benefit at all from his opponents’ future actions.
The construction of asymmetric equilibria in Keller et al. (2005) relies on a backward-
induction approach anchored at the single-agent cutoff belief. The presence of the en-
couragement effect obviously rules out such an approach in the Poisson model. The
construction of asymmetric equilibria is further complicated by the “nonlocal” nature
of the problem: each player’s payoff depends not only on the action profiles and con-
tinuation values prevailing in a neighborhood of the current belief, but also on the con-
tinuation value at the belief that would be reached after a success on a risky arm. Our
second main contribution is to provide a constructive solution to this problem.
The approach that we use here rests on two ideas. The first is to give the players a
common continuation value after any success on a risky arm; the second is to let them
alternate finitely many times between the roles of experimenter and free-rider before all
experimentation stops. Assigning common continuation values after successes allows
us to construct the players’ average payoff function before assigning individual strate-
gies, which simplifies the problem considerably. Letting players take turns playing risky
278 Keller and Rady Theoretical Economics 5 (2010)
struct an equilibrium without a last experimenter, that is, one where players switch actions infinitely often
over a finite time interval; cf. Section 6.2 of Keller et al. (2005). As infinite switching is not required for such a
Pareto improvement in the Poisson model, we do not consider it here and instead restrict players to Markov
strategies with finitely many discontinuities.
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 279
by giving him a higher continuation value after a success than his opponent. This yields
equilibria with progressively increasing payoff asymmetry between the players and in
the limit replicates the most inequitable equilibria of the exponential model.
We find that rewarding the last experimenter raises average payoffs at relatively pes-
simistic beliefs (because the last experimenter is willing to play risky over a larger range
of beliefs), but lowers them at optimistic beliefs (because the intensity of experimenta-
tion drops from 2 to 1 earlier when the difference between the players’ equilibrium pay-
offs increases). This means, in particular, that although we can approach the best two-
player equilibria of the exponential model arbitrarily closely with asymmetric equilibria
of the Poisson model, the latter do not constitute uniformly best equilibria themselves.
More generally, to the extent that we can always reward the last experimenter some-
what more, this suggests the conjecture that the Poisson model admits no uniformly
best MPE.3
In the alternation phase of some of the examples we calculate, we observe an an-
ticipation effect that is already present in the infinite-switching equilibria of Keller et
al. (2005): the value function of a lone experimenter decreases in the current belief over
some range. The intuition for this effect is that, conditional on no impending success
on his own risky arm, the player will soon be able to enjoy a free ride, and the lower the
current belief, the sooner this time will come.
For each type of equilibrium that we construct, we provide representations of the
players’ payoff functions that are explicit up to some constant of integration. While
these representations cannot be used to derive comparative statics results, say, they
greatly simplify the computation of numerical examples. Asymmetric equilibria with
common continuation values after a success just require a one-dimensional search for
an implicitly defined constant of integration. In the two-player equilibria that reward
the last experimenter, we have to solve for two of these constants. Either task is easily
carried out in a spreadsheet.
The Poisson model is a natural analog in continuous time of the two-outcome bandit
model in Rothschild (1974), the first paper to use the bandit framework in economics;
see Bergemann and Välimäki (2008) for a survey of the ensuing literature. Through its
focus on bandit learning as a canonical model of strategic experimentation in teams, our
paper is most closely related to Bolton and Harris (1999, 2000) and Keller et al. (2005),
sharing with them the assumption that the players face risky arms of a common type.
Klein and Rady (2008) and Klein (2010), by contrast, consider two players who face risky
arms of opposite types, one good and one bad. Strulovici (2010) studies majority voting
in a collective decision problem where the type of the risky arm varies across individuals.
Our paper is further related to a strand of the industrial organization literature that
studies R&D investment games under learning. Malueg and Tsutsui (1997) investigate a
model of a patent race with learning where the arrival time of an innovation is exponen-
tially distributed given the stock of knowledge, implying the same deterministic belief
3 This does not preclude the possibility that for each belief, there exists a Markov perfect equilibrium
that maximizes average payoffs when the game is started at this particular belief. An investigation of this
possibility and of the upper envelope of average equilibrium payoffs is beyond the scope of this paper.
280 Keller and Rady Theoretical Economics 5 (2010)
revision prior to the innovation as our model exhibits in between lump sums. Build-
ing on the exponential bandit framework of Keller et al. (2005), Besanko and Wu (2008)
study the effects of post-innovation market structure on cooperative and competitive
R&D investments, respectively. Décamps and Mariotti (2004), Hopenhayn and Squin-
tani (2008), and Moscarini and Squintani (2007) all analyze models where news arrives
in the form of the increments of a (compound) Poisson process; as they consider stop-
ping games with private information, however, the resulting strategic interactions are
very different from that in our model.
The remainder of the paper is organized as follows. Section 2 sets up the Poisson
bandit model. Section 3 establishes the efficient benchmark. Section 4 introduces the
strategic problem, establishes the encouragement effect, and proves the impossibility
of cutoff equilibria. Section 5 presents the unique symmetric MPE. Section 6 constructs
asymmetric equilibria in which players take turns playing the risky arm before all ex-
perimentation stops. Section 7 studies two-player equilibria that reward the last exper-
imenter. Section 8 contains concluding remarks. Proofs for all of the corollaries and for
Proposition 7 are relegated to the Appendix.
2. Poisson bandits
The setup of the model is similar to that of Keller et al. (2005), the principal difference
being that here a bad risky arm yields positive payoffs (as opposed to zero), which means
that a success does not reveal the risky arm to be good. For mathematical details on Pois-
son bandits, see Presman (1990) or Presman and Sonin (1990); for the optimal control of
piecewise deterministic processes more broadly, see Davis (1993).
Time t ∈ [0 ∞) is continuous, and the discount rate is r > 0. There are N ≥ 1 play-
ers, each endowed with one unit of a perfectly divisible resource per unit of time and
each facing a two-armed bandit problem. Lump-sum rewards on the risky arm R are
independent draws from a time-invariant distribution on R \ {0} with a known mean
h > 0. If a player allocates the fraction kt ∈ [0 1] of her resource to R over an interval of
time [t t + dt[ and, consequently, the fraction 1 − kt to the safe arm S, then she receives
the expected payoff (1 − kt )s dt from S, where s > 0 is a constant known to all players.
The probability that she receives a lump-sum payoff from R at some point in the inter-
val is kt λθ dt, where θ = 1 if R is good, θ = 0 if R is bad, and λ1 > λ0 > 0 are constants
known to all players. Therefore, the overall expected payoff increment conditional on θ
is [(1 − kt )s + kt λθ h] dt. We assume that λ0 h < s < λ1 h, so each player prefers R to S if R
is good and prefers S to R if R is bad.
However, players do not know whether the risky arm is good or bad; they start with
a common prior belief about θ. Thereafter, all players observe each other’s actions and
outcomes, so they hold common posterior beliefs throughout time. With pt denoting
the subjective probability at time t that players assign to the risky arm being good, a
player’s expected payoff increment conditional on all available information is [(1−kt )s +
kt λ(pt )h] dt with
Given a player’s actions {kt }t≥0 such that kt is measurable with respect to the informa-
tion available at time t, her total expected discounted payoff, expressed in per-period
units, is
∞
−rt
E re [(1 − kt )s + kt λ(pt )h] dt
0
where the expectation is over the stochastic processes {kt } and {pt }. We note that a
player’s payoff depends on others’ actions only through their effect on the evolution of
beliefs, which constitute a natural state variable.
To derive the law of motion of beliefs, suppose that over the interval of time [t t +t[,
player n allocates the constant fraction knt of her resource to her risky arm. The sum
Kt = N n=1 knt measures how much of the overall resource is allocated to risky arms; we
call this number the intensity of experimentation. Conditional on the type of the risky
arm, the arrival of lump sums is independent across players. If the risky arms are good,
the probability of none of the players receiving a lump-sum payoff is e−Kt λ1 t , and if
they are bad, this probability is e−Kt λ0 t . Therefore, given no lump-sum payoff arriving
in [t t + t[, the belief at the end of that time period is
pt e−Kt λ1 t
pt+t =
(1 − pt )e−Kt λ0 t + pt e−Kt λ1 t
by Bayes’ rule. As long as no lump sum arrives, the belief thus evolves smoothly with
infinitesimal increment dpt = −Kt λpt (1 − pt ) dt, where λ = λ1 − λ0 . However, if any
of the players receives a lump sum at time t, the belief jumps up from pt− (the limit of
beliefs held before the arrival of the lump sum) to
for the function that describes beliefs after a success on a risky arm.
We restrict players to Markovian strategies kn : [0 1] → [0 1] with the left limit belief
pt− as the state variable, so that the action player n takes at time t is kn (pt− ).4 We im-
pose the following restrictions on these strategies: (i) kn is left-continuous; (ii) there is a
finite partition of [0 1] into intervals of positive length on each of which kn is Lipschitz-
continuous. By standard results, each profile (k1 k2 kN ) of such strategies induces
a well defined law of motion for players’ common beliefs and well defined payoff func-
tions. A simple strategy is one that takes values in {0 1} only, meaning that the player
uses one arm exclusively at any given point in time. Finally, a strategy kn is a cutoff
strategy if there is a belief p̂ such that kn (p) = 1 for all p > p̂ and kn (p) = 0 otherwise.
4 By definition, p0− = p0 . Note that pt− = pt at almost all t. Working with pt− instead of pt merely
enforces the informational restriction that the action taken at time t cannot be conditioned on the arrival
of a lump sum at that time.
282 Keller and Rady Theoretical Economics 5 (2010)
As a benchmark, a myopic agent simply weighs the short-run payoff from playing
the safe arm, s, against what he expects from playing the risky arm, λ(p)h. So he uses
the cutoff belief
s − λ0 h
pm =
λh
playing R for p > pm and S for p ≤ pm .
Consider N players jointly maximizing their average expected payoff. By the same ar-
guments as in Keller et al. (2005), the value function for the cooperative, expressed as
average payoff per agent, satisfies the Bellman equation
c(p) = s − λ(p)h
is the expected benefit of playing R. The latter has two parts: a discrete improvement in
the overall payoff after a success and a marginal decrease otherwise.5
If the shared opportunity cost of playing R exceeds the full expected benefit, the
optimal choice is K = 0 (all agents use S exclusively) and u(p) = s. Otherwise, K = N is
optimal (all agents use R exclusively) and u satisfies the first-order ordinary differential-
difference equation (henceforth ODDE)
r r
λp(1 − p)u (p) − λ(p) u(j(p)) − u(p) + u(p) = λ(p)h (1)
N N
A particular solution to this equation is u(p) = λ(p)h, the expected per capita payoff
from all agents using the risky arm forever.
The option value of being able to change to the safe arm is then captured by the
solution to the homogeneous equation, for which we try u0 (p) = (1 − p)(p)μ for some
μ > 0 to be determined, where
1−p
(p) =
p
5 Infinitesimal changes of the belief are always downward, so it is, in fact, the left-hand derivative of the
value function that matters here. This observation will turn out to be of relevance in asymmetric equilibria
of the strategic experimentation game.
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 283
Inserting these into the homogeneous equation and simplifying leads to the require-
ment that
μ
r λ0
+ λ0 − μλ = λ0 (2)
N λ1
As a function of μ, the left-hand side of (2) is a negatively sloped straight line that cuts the
vertical axis at r/N + λ0 . The right-hand side is a decreasing exponential function which
tends to 0 as μ → +∞, tends to ∞ as μ → −∞, and cuts the vertical axis at λ0 . Thus
the above equation in μ has two solutions, one positive and one negative; we write μN
for the positive solution, which obviously lies between r/(Nλ) (the value of μ where the
left-hand side of (2) equals λ0 ) and r/(Nλ) + λ0 /λ (the value of μ where it equals 0). As
the left-hand side of (2) rises with r/N, we also see that μN is increasing in the discount
rate and decreasing in the number of agents.
The solution to the ODDE for K = N is thus
μN (s − λ0 h)
p∗N = (4)
(μN + 1)(λ1 h − s) + μN (s − λ0 h)
such that below the cutoff it is optimal for all to play S exclusively and above the cutoff it
is optimal for all to play R exclusively. The value function VN∗ for the N-agent cooperative
is given by
μN
1−p (p)
VN∗ (p) = λ(p)h + c(p∗N ) (5)
1 − p∗N (p∗N )
when p > p∗N and by VN∗ (p) = s otherwise.
6 This guess can be obtained by extrapolation from the limiting case λ = 0 studied in Keller et al. (2005).
0
In this case, j(p) = 1 and u(j(p)) = λ1 h, so (1) becomes a linear differential equation; the above function
u0 is easily seen to solve the corresponding homogeneous equation for μ = r/(Nλ1 ). A more systematic
approach is to write the solution of the homogeneous equation as 1 − p times some unknown function, re-
flecting the fact that the option to switch from R to S is valuable only if the risky arm is bad. After a change
of the independent variable from p to ln (p), this unknown function then solves a linear ODDE with con-
stant coefficients and constant delay to which results from Bellman and Cooke (1963) can be applied.
7 The planner’s solution in Bolton and Harris (1999) has the same structure. Only the expression for the
expected current payoff from a risky arm and the exponent of the odds ratio differ across setups. Cohen and
Solan (2009) show that this continues to be true when the risky arm generates payoffs according to a Lévy
process (that is, a continuous-time process with stationary independent increments) with a binary prior on
its characteristics.
284 Keller and Rady Theoretical Economics 5 (2010)
Proof. The expression for p∗N and the constant of integration in (5) are obtained by
imposing VN∗ (p∗N ) = s (value matching) and (VN∗ ) (p∗N ) = 0 (smooth pasting). Then
b(p VN∗ ) falls short of c(p)/N to the left of p∗N , coincides with it at p∗N , and exceeds it to
the right of p∗N . So VN∗ solves the Bellman equation, with the maximum being achieved
at the intensity of experimentation stated in the proposition.
The above proposition determines the efficient strategies. As in Bolton and Harris
(1999) and Keller et al. (2005), it is efficient to use a common cutoff strategy; the cutoff
increases in s and μN (and hence in r/N). The efficient intensity of experimentation
exhibits a bang-bang feature, being maximal when the current belief is above p∗N and
minimal when it is below.
on [0 1], with the second term on the right-hand side measuring the benefit of the in-
formation generated by the other players.
A strategy k∗n for player n is a best response against his opponents’ strategies if and
only if the resulting payoff function un solves the Bellman equation
on [0 1], and k∗n (p) achieves the maximum on the right-hand side at each belief p. It is
straightforward to show that if player n plays a best response, his benefit of experimen-
tation b(p un ) is nonnegative at all beliefs, and his payoff function un is nondecreasing
in the other players’ experimentation schedule K¬n . Standard results further imply that
a best-response payoff function un is once continuously differentiable at any point of
continuity of K¬n .
At the boundaries of the unit interval, the obvious optimal actions are k∗n (0) = 0
and k∗n (1) = 1, which implies un (0) = s and un (1) = λ1 h for the player’s payoff function.
More generally, player n’s best response is obtained by comparing the opportunity cost
of playing R with the expected private benefit. If c(p) > b(p un ), then k∗n (p) = 0, and
the Bellman equation implies un (p) = s + K¬n (p)b(p un ) < s + K¬n (p)c(p). If c(p) =
b(p un ), then k∗n (p) is arbitrary in [0 1], and un (p) = s + K¬n (p)c(p). Finally, if c(p) <
b(p un ), then k∗n (p) = 1, and un (p) = s + (K¬n (p) + 1)b(p un ) − c(p) > s + K¬n (p)c(p).
Thus, exactly as in Keller et al. (2005), player n’s best response to a given intensity of
experimentation K¬n depends on whether in the (p u) plane, the point (p un (p)) lies
below, on, or above the line
For K¬n > 0 this is a downward sloping diagonal that cuts the safe payoff line u = s at the
myopic cutoff pm ; for K¬n = 0, it coincides with the safe payoff line.
The following two observations also carry over verbatim from Keller et al. (2005).
First, no profile of Markov strategies can generate an average payoff that exceeds VN∗ ,
and the payoff of a player using a best response to her opponents’ strategies cannot
fall below V1∗ . The upper bound follows immediately from the fact that the cooperative
solution maximizes the average payoff. The intuition for the lower bound is that an
agent can only benefit from the information generated by others. Second, all Markov
perfect equilibria are inefficient. Along the efficient experimentation path, the benefit
of an additional experiment tends to 1/N of its opportunity cost as p approaches p∗N .
A self-interested player compares the benefit of an additional experiment with the full
opportunity cost and so has an incentive to deviate from the efficient path by using S
instead of R.
It is also easy to see that in any Markov perfect equilibrium, the set of beliefs at which
the intensity of experimentation is positive must be an interval, and that it must contain
the half-open interval ]p∗1 1], where p∗1 is the single-agent cutoff. The interesting ques-
tion is whether experimentation continues below p∗1 , i.e., whether there is an encour-
agement effect. Bolton and Harris (1999) show that the encouragement effect is present
in the symmetric Markov perfect equilibrium of their model. The next result shows that
all MPE of our model exhibit the encouragement effect.
Proof. Suppose to the contrary that all players play S at all beliefs p ≤ p∗1 . Then each
player’s payoff function satisfies un (p∗1 ) = s with the left-hand derivative un (p∗1 −) = 0.
For S to be optimal, we must have b(p∗1 un ) ≤ c(p∗1 ) = b(p∗1 V1∗ ), and hence un (j(p∗1 )) ≤
V1∗ (j(p∗1 )), which must in fact hold as an equality. Thus, the difference un − V1∗ assumes
its minimum (of 0) at j(p∗1 ), which implies un (j(p∗1 )−) ≤ (V1∗ ) (j(p∗1 )−). As un (j 2 (p∗1 )) ≥
V1∗ (j 2 (p∗1 )), this implies b(j(p∗1 ) un ) ≥ b(j(p∗1 ) V1∗ ) and hence b(j(p∗1 ) un ) > c(j(p∗1 )).
So all players must use R at the belief j(p∗1 ). By the ODDE for V1∗ and the explicit so-
lution in Proposition 1, we have b(j(p∗1 ) V1∗ ) = V1∗ (j(p∗1 )) − s + c(j(p∗1 )) = V1∗ (j(p∗1 )) −
λ(j(p∗1 ))h > 0. Each player’s Bellman equation now yields
The idea behind the proof is that the only way that all experimentation could stop at
p∗1 is for the “jump benefit” to be the same for each of the N players as for a lone agent,
286 Keller and Rady Theoretical Economics 5 (2010)
given the same opportunity cost and the same “slide disbenefit”; but this would imply
that un and V1∗ matched in value not only at p∗1 but also at j(p∗1 ). This is not possible,
since at j(p∗1 ) the benefit of a further jump up is no less and the disbenefit of a slide
down is no worse for player n than for a lone agent, and if a lone agent has an incentive
to experiment, then so do each of the N players—the positive externality resulting in a
higher value at j(p∗1 ).
Our second general result on Markov perfect equilibria concerns the nonexistence
of equilibria where all players use cutoff strategies.
Proof. Suppose to the contrary that there is an MPE where all players use a cutoff strat-
egy. For n = 1 N, let pn denote the belief at which player n switches from using R
exclusively to using S exclusively. Clearly, pn ≤ pm for all n. Without loss of generality,
we can assume that p1 ≤ p2 ≤ · · · ≤ pN−1 ≤ pN . Moreover, we must have p1 < pm , since
each player would have an incentive to deviate to the optimal strategy of a single player
otherwise.
Suppose that p1 = p2 . Immediately to the right of this cutoff, both u1 and u2 must
then lie below D1 , so players 1 and 2 playing R are not best responses. This proves that
p1 < p 2 .
Now, u2 must lie below D1 immediately to the left of p2 (as player 2 finds it optimal to
free-ride on one opponent who plays R) and above D1 immediately to the right of p2 (as
player 2 finds it optimal to join in with at least one opponent who plays R), so u2 crosses
D1 at p2 . (In fact, one can iterate this argument to establish that all cutoffs are different,
and that un crosses Dn−1 at pn .)
Since a player’s payoff function is weakly increasing in the intensity of experimenta-
tion provided by the other players, we have u1 ≤ u2 , and so u1 is either below or exactly
on D1 at p2 . In the first case, there is an interval ]p2 p2 + [, where player 1 (who is as-
sumed to play R) is not responding optimally to the other players’ combined intensity of
experimentation K¬1 = 1. In the second case, u1 = u2 on [p2 1] and u1 (p2 −) ≥ u2 (p2 −),
hence b(p2 u1 ) ≤ b(p2 u2 ). But then, u2 (p2 ) = s + b(p2 u2 ) > s + b(p2 u1 ) − c(p2 ) =
u1 (p2 ), a contradiction.
5. Symmetric equilibrium
Otherwise, either all players play S exclusively and the common payoff is u(p) = s or all
players play R exclusively and the common payoff function u satisfies (1), hence is of the
form VN given in (3).
In the (p u) plane, the region where all players use the risky arm exclusively and the
region where they use both arms simultaneously are separated by the diagonal DN−1 .
Given the post-jump value u(j(p)), we have smooth pasting of the solutions to (1) and
(6) along DN−1 . Smooth pasting also occurs at the boundary of the region where all play-
ers use S exclusively with the region where they use both arms. In other words, u must be
of class C 1 . To see this, suppose we had a symmetric equilibrium with a payoff function
that hits the level s at the belief p̃ with slope u (p̃+) > 0. Then, at beliefs immediately to
the right of p̃, we would have b(p u) = c(p) or
λ(p) u(j(p)) − u(p) /r = c(p) + λp(1 − p)u (p)/r
implying
λ(p̃) u(j(p̃)) − s /r = c(p̃) + λp̃(1 − p̃)u (p̃+)/r > c(p̃)
by continuity. Immediately to the left of p̃, continuity of u(j(p)) and the fact that
u (p) = 0 would then imply b(p u) = λ(p)[u(j(p)) − s]/r > c(p), so there would be an
incentive to deviate from S to R.
for p̃N < p < p†N , and k(p) = 1 for p ≥ p†N . The payoff function WN is increasing on
[p̃N 1] and the strategy k is increasing on [p̃N p†N ].
Proof. We first show that there is at most one symmetric MPE. Suppose that we have
two symmetric equilibria with different payoff functions u and û, respectively, both of
which must be of class C 1 . Let u − û assume a negative global minimum at the belief
p, which by necessity must lie in the open unit interval. At this belief, u (p) = û (p)
and u(j(p)) − û(j(p)) ≥ u(p) − û(p), so b(p u) ≥ b(p û). We cannot have both u(p)
and û(p) above DN−1 , since in this region both u and û are of the form (3) and the
difference u − û is increasing to the right of DN−1 . Further, if û(p) is above DN−1
and u(p) is on or below, then b(p û) > c(p) = b(p u) in contradiction to what we de-
rived before. Consequently, we must have both û(p) and u(p) on or below DN−1 , so
b(p û) = c(p) = b(p u). This in turn yields u(j(p)) − û(j(p)) = u(p) − û(p), so the dif-
ference u − û is also at its minimum at the belief j(p). Iterating the argument until we
288 Keller and Rady Theoretical Economics 5 (2010)
get to the right of pm (and hence to the right of DN−1 ), we obtain another contradiction,
which proves that u ≥ û. By the same arguments, û − u cannot assume a negative global
minimum either, and so u = û.
Next, we sketch the construction of the symmetric equilibrium; for details, see the
Appendix. Varying the point of intersection with the diagonal DN−1 , one first constructs
a family of candidate value functions that solve the ODDE (1) (N players using R ex-
clusively) above DN−1 and the ODDE (6) (indifference between R and S) below. Using
an intermediate-value argument, we then establish the existence of one such function
that reaches the level s with zero slope as we move down from p = pm to lower beliefs.
This function is easily seen to solve each player’s Bellman equation. Finally, the identity
un (p) = s + K¬n (p)c(p) uniquely determines the common intensity of experimentation
in the range of beliefs where the value function lies below DN−1 but above the level s.
on Ji ∩ {p | WN (p) > s} for some constants C (i−η) (η = 0 i − 1), chosen to ensure con-
tinuity of WN . The constant C (0) ensuring that (p†N WN (p†N )) ∈ DN−1 is given by
The proof (given in the Appendix) shows how the constants C (i) can be calculated
recursively given C (0) .
Figure 2 in Section 6 below depicts the intensity of experimentation in the symmetric
equilibrium with two players (rightmost solid curve). The dotted step function is the
efficient intensity.
The symmetric equilibrium of the Poisson model shares the main features with its
counterpart in the Brownian model of Bolton and Harris (1999). First, because of the in-
centive to free-ride, experimentation stops for good inefficiently early (the lower thresh-
old p̃N is above the cooperative cutoff p∗N ), and the intensity of experimentation is ineffi-
ciently low at any belief between p∗N and p†N . Second, there is the encouragement effect
(p̃N is below the single-agent cutoff p∗1 ). Third, both the incentive to free-ride and the
8 The proof makes it obvious how to modify this result in the knife-edge case where μN = λ0 /λ.
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 289
encouragement effect become stronger as the number of players increases.9 Fourth, the
acquisition of information is slowed down so severely near p̃N that the players’ beliefs
cannot reach this threshold in finite time.
Corollary 2. Starting from a prior belief above p̃N , the players’ common posterior belief
never reaches this threshold in the symmetric Markov perfect equilibrium.
This result strongly suggests that asymmetric equilibria where the rate of informa-
tion acquisition stays bounded away from zero before all experimentation ceases ought
to be more efficient than symmetric equilibrium. The next section confirms this.
6. Asymmetric equilibria
Our construction of asymmetric Markov perfect equilibria rests on two ideas. The first
is to give the players a common continuation value after any success on a risky arm; this
allows us to construct the players’ average payoff function before assigning individual
strategies. The second is to let them alternate between the roles of experimenter and
free-rider before all experimentation stops; this allows us to achieve an overall intensity
of experimentation higher than in the symmetric equilibrium, yielding higher equilib-
rium payoffs.
In fact, for points (p u) below the diagonal D1−1/N , the common action that keeps
all players indifferent between R and S, and gives them u as the common continuation
value, k = (u − s)/[(N − 1)c(p)], implies an intensity of experimentation K = Nk < 1.
In contrast, the equilibria we construct have K = 1 over some range of beliefs where the
graph of the average payoff function lies below D1−1/N . We achieve this by partitioning
the range in question into a finite number of intervals, on each of which exactly one
player plays risky.
If the last of these “lone experimenters” stops using the risky arm at the belief p̄,
his value function u satisfies λ(p̄)[u(j(p̄)) − s]/r = c(p̄) as the left derivative u (p̄) = 0.
When all players have a common continuation value after a success on a risky arm,
this equation also holds for the players’ average payoff function ū and so λ(p̄) ×
[ū(j(p̄)) − s]/r = c(p̄). Varying p̄, we can trace out the locus D̄ of all possible post-jump
points (j(p̄) ū(j(p̄))) in the (p u) plane that satisfy this condition:
D̄ = (p u) ∈ [0 1] × R+ | λ(j −1 (p))[u − s]/r = c(j −1 (p))
Using the fact that λ(j −1 (p)) = λ0 λ1 /[pλ0 + (1 − p)λ1 ], it is straightforward to show
that D̄ is a downward sloping straight line through the points (0 s + r[s − λ0 h]/λ0 ) and
(j(pm ) s).
To ensure both a common continuation value after any success and an increase in
the intensity of experimentation relative to the symmetric MPE, we start our construc-
tion of the average equilibrium payoff function at some point (p u) on the lower en-
velope D̄ ∧ D1−1/N of the diagonals D̄ and D1−1/N . This lower envelope coincides with
9 As N increases, each player obtains a higher payoff at all beliefs where the risky arm is used some of
the time, and p̃N falls. The diagonal DN−1 rotates clockwise, tending to increase p†N , but since the payoff
function shifts upward, the overall effect on p†N is ambiguous.
290 Keller and Rady Theoretical Economics 5 (2010)
D1−1/N if and only if r/λ0 ≥ 1 − 1/N, so D̄ is relevant only for sufficiently high λ0 , that is,
for sufficiently small jumps of beliefs after successes. To the right of p , we proceed as in
the construction of the symmetric MPE. To the left of p , we solve for the average pay-
off function when one player out of N is playing risky. Varying p , we then ensure that
the average payoff hits the level s at a belief p , where the last experimenter is indeed
indifferent between playing risky and playing safe. If the point (p u) thus determined
lies below D̄ (and hence on D1−1/N ), we have j(p ) > p ; if this point lies on D̄ , we have
j(p ) = p . In either case, a success at any belief to the right of p makes the belief jump
to the right of p , where the equilibrium involves symmetric actions and continuation
payoffs that coincide with the average. Between p and p , moreover, the graph of the
average payoff function lies below D1−1/N , and so an intensity of experimentation equal
to 1 is indeed more than would be compatible with symmetric behavior.
For N = 2, Figure 1 illustrates the payoff functions that can arise in the equilibria we
construct and gives the corresponding intensity of experimentation in various regions
of the (p u) plane. The faint straight lines ending in (pm s) are the diagonals D1 and
D1/2 ; the faint straight line ending on D1/2 is the part of D̄ ∧ D1/2 that lies below D1/2 ;
the solid kinked line is the myopic payoff. The solid curves are the graphs of the players’
payoff functions. The equilibrium intensity of experimentation varies along the graph of
the average equilibrium payoff function. The intensity is 2 when the graph is above D1 ,
between 1 and 2 when the graph lies between D1/2 and D1 , etc. The intensity of experi-
mentation is continuous in beliefs at p if the graph crosses D̄ ∧ D1/2 on D1/2 , as in the
figure. If the graph crosses D̄ ∧ D1/2 below D1/2 , the intensity jumps at the belief p .
Figure 2 compares the intensity of experimentation with that in the symmetric two-
player equilibrium. The dotted step function is the efficient intensity; the rightmost
solid curve is the intensity in the symmetric MPE.
For arbitrary N, we have the following result.
Proof. We just sketch the construction of the equilibrium here; details can be found in
the Appendix. First, we construct the players’ average payoff function ū in the purported
equilibria, using an approach similar to the proof of Proposition 4. This function is in-
creasing on [pN 1]. Its graph crosses D̄ ∧ D1−1/N at pN and DN−1 at p‡N . It has a kink
at pN with ū (pN −) > ū (pN +) if and only if the intersection with D̄ ∧ D1−1/N is below
D1−1/N . It satisfies ū(p) = s + b(p ū) − c(p)/N between pN and pN , solves the indiffer-
ence ODDE (6) between pN and p‡N , and is of the form (3) above p‡N . The average jump
benefit λ(pN )[ū(j(pN )) − s]/r exactly equals the opportunity cost c(pN ). As all players’
payoff functions have a zero left-hand derivative at pN and a common value of ū(j(pN ))
at j(pN ), each player will, therefore, be indifferent between R and S at pN .
Second, we construct the players’ payoff functions and strategies. To this end, we
split the interval ]pN pN ] into finitely many subintervals ]pi pri ] and in turn partition
292 Keller and Rady Theoretical Economics 5 (2010)
each of them into N intervals I1i INi . We let player n use R on all intervals Ini
and use S on ]pN pN ] \ i Ini . Using intermediate-value arguments, we can choose
the intervals I1i INi such that each player’s payoff function coincides with ū at the
boundaries of each subinterval ]pi pri ]. By increasing the number and reducing the
size of these subintervals, moreover, we can ensure that the vertical distance |un − ū|
remains below some given real number δ > 0 for all n.
Third, we verify that for sufficiently small δ, that is, for sufficiently frequent alter-
nation between the roles of free-rider and experimenter on ]pN pN ], the strategies we
have constructed are mutually best responses.
As to the comparison of the average payoff function ū with that of the symmet-
ric equilibrium, WN , suppose that ū − WN assumes a negative global minimum at the
belief p in the open unit interval. Note that ū must be differentiable there, since a
kink with ū (pN −) > ū (pN +) is incompatible with even a local minimum of ū − WN at
pN . At the belief p, therefore, WN (p) = ū (p) and ū(j(p)) − WN (j(p)) ≥ ū(p) − WN (p),
so b(p ū) ≥ b(p WN ). We cannot have both WN (p) and ū(p) above D1 , since in
this region both WN and ū are of the form (3) and so the difference ū − WN is in-
creasing. Further, if WN (p) is above D1 and ū(p) is on or below that diagonal, then
b(p WN ) > c(p) ≥ b(p ū), in contradiction to what we derived before (to the left of
pN , ū(p) = s + b(p ū) − c(p)/N < s + (1 − 1/N)c(p) and hence b(p ū) < c(p)). Con-
sequently, we must have both WN (p) and ū(p) on or below D1 , which translates into
b(p WN ) = c(p) ≥ b(p ū) and hence, by what we saw above, b(p WN ) = b(p ū). This
in turn yields ū(j(p)) − WN (j(p)) = ū(p) − WN (p), so the difference ū − WN is also at its
minimum at the belief j(p). Iterating the argument until we get to the right of pm (and
hence to the right of D1 ), we again obtain a contradiction. This establishes ū ≥ WN . Now,
if we had ū(p) = WN (p) at some p ∈ ]pN 1[, this would again imply b(p ū) = b(p WN )
at a global minimum of ū − WN and, by the iterative argument just given, lead to another
contradiction. This proves that ū > WN on ]pN 1[. In particular, pN ≤ p̃N , the belief at
which all experimentation stops in the symmetric MPE. The equality pN = p̃N would
entail λ(pN )[ū(j(pN )) − s]/r = c(pN ) = λ(pN )[WN (j(pN )) − s]/r and hence ū(j(pN )) =
WN (j(pN )), which we have already shown to be impossible.
The gain in average payoffs relative to the symmetric equilibrium stems from the fact
that, owing to the alternation between the roles of lone experimenter and free-rider, the
intensity of experimentation is bounded away from zero immediately above the belief
where all experimentation stops. In the symmetric equilibrium, a player who deviates
to the safe action slows down the gradual slide of beliefs toward more pessimism; as
the opponents’ strategies are increasing functions of the level of optimism, the devia-
tion causes them to experiment more than they would on the equilibrium path. When
players use beliefs to coordinate their alternation between experimentation and free-
riding, by contrast, a deviation from the risky to the safe action freezes the belief in its
current state and delays the time at which another player takes over the burden of exper-
imentation. Deviations are thus more attractive under symmetric strategies than under
alternation. This explains why the equilibrium intensity under the latter can be higher.
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 293
For beliefs above pN , the players’ common payoff function permits an explicit rep-
resentation of the form given in Corollary 1. For beliefs between pN and pN , we have
the following result.
Corollary 3. For p ∈ ]pN pN ], let ι be the smallest integer such that j ι+1 (p) ≥ p‡N , i.e.,
ι + 1 consecutive successes would result in all the players playing R exclusively. Then, with
kn = 1 for an experimenter and kn = 0 for a free-rider, the payoff functions are
r r
un (p) = λ(p)h + (ι + kn − 1) (λ1 h − s)p − (s − λ0 h)(1 − p)
r + λ1 r + λ0
ι
λ0 (λ0 /λ1 )μN λ0 (λ0 /λ1 )μN
+ C (0) (1 − p)(p)μN
λ0 − μN λ r + λ0 − μN λ
ι−1 η
1 C (ι−η) λ0 (λ0 /λ1 )λ0 /λ
+ − λ0 (λ0 /λ1 )λ0 /λ
r η! λ
η=0
η η−γ
λ η! γ
× ln[(λ0 /λ1 )γ (p)] (1 − p)(p)λ0 /λ
r γ!
γ=0
With sufficiently frequent turns between the roles of experimenter and free-rider,
the players’ payoff functions in the equilibria of Proposition 5 become arbitrarily close
to the average payoff function. This leads to a Pareto improvement over the symmetric
equilibrium.
Proposition 6 (Pareto improvement over the symmetric MPE). The N-player exper-
imentation game admits Markov perfect equilibria as in Proposition 5 in which each
player’s payoff exceeds the symmetric equilibrium payoff on ]pN 1[.
Proof. Let δ = (1/2) maxp̃ [ū(p) − WN (p)], where ū is the average payoff func-
N ≤p≤pN
tion associated with the equilibria of Proposition 5 and WN is the players’ common pay-
off function in the symmetric equilibrium. Choose the subintervals ]pi pri ] such that
|un − ū| is bounded above by δ for all n. Then un > s = WN on ]pN p̃N ], un ≥ ū − δ > WN
on ]p̃N pN ], and un = ū > WN on ]pN 1].
It deserves to be stressed that it is the encouragement effect that permits Pareto im-
provements over the symmetric equilibrium. Without it, the last experimenter quits at
the same belief (the single-agent cutoff ) at which all players stop experimenting in the
symmetric equilibrium; bearing all the costs of experimentation on his own, the last ex-
perimenter is then necessarily worse off than under symmetry immediately to the right
of this cutoff.
294 Keller and Rady Theoretical Economics 5 (2010)
There clearly is scope for further improvements in players’ equilibrium payoffs, over
and above those embodied in the equilibria of Proposition 5. As we move down from the
diagonal DN−2+1/N to DN−2 , the intensity of experimentation in these equilibria gradu-
ally falls from N − 1 to N(N − 2)/(N − 1). Using exactly the same approach as on the
interval ]pN pN ] above, we could instead let players take turns between the roles of ex-
perimenter and free-rider such that the intensity of experimentation remains constant
at level N − 1 in between these diagonals.
This raises the question as to whether there exists a best Markov perfect equilibrium,
in a sense to be made precise. To shed some light on this question, we focus on the two-
player case from now on and take the corresponding results for the exponential model
as our starting point.
First, the exponential model admits uniformly best and worst two-player equilibria
that achieve the maximal (resp. minimal) intensity of experimentation and the maximal
(resp. minimal) average payoff compatible with Markov perfection and finite switch-
ing, irrespective of the players’ initial belief. The worst equilibrium is the symmetric
MPE, whereas the best equilibria have the same structure as the two-player equilibria of
Proposition 5, with the intensity of experimentation being the limit as λ0 tends to 0 of
the intensity described in that proposition.10
Second, the exponential model admits most inequitable equilibria that give the two
players extremal individual payoffs. These equilibria involve simple strategies, so that
each player uses one arm exclusively at any given belief.11 Here there are three thresh-
olds p̄ = p∗1 < p̂s < p̂2 such that both players play risky above p̂2 , one of the players
plays risky between p̂s and p̂2 , the other player plays risky between p̄ and p̂s , and both
play safe below p̄. The payoffs achieved by the last experimenter and the last free-rider
in these equilibria constitute a uniform lower and upper bound, respectively, on the
individual equilibrium payoffs that are compatible with Markov perfection and finite
switching.
Our next aim is to investigate whether these most inequitable equilibria can also be
obtained as the limit of two-player equilibria of the Poisson model for vanishing λ0 > 0.
To this end, we first demonstrate the existence of a benchmark simple MPE that gives
the players common continuation payoffs after any success on a risky arm. In a nu-
merical example, we then progressively diverge from this benchmark by rewarding the
last experimenter, that is, by giving him a higher continuation value after a success than
his opponent. This yields equilibria with progressively increasing payoff asymmetry be-
tween the players. It also suggests that we cannot expect a uniformly best Markov perfect
equilibrium to exist in the Poisson model—not even in the two-player case.
equilibrium of the exponential model for N > 2, existence and structure of a best MPE are open questions
in this case.
11 See Proposition 6.1 of Keller et al. (2005).
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 295
them. More precisely, as any two-player MPE must have an average payoff function in
between the single-agent optimum V1∗ and the cooperative solution V2∗ , the infimum
of the set {p | K(p) > 0} must be at least p∗2 . If j(p∗2 ) ≥ pm , a success on any risky arm
will make players optimistic enough for all of them to revert to exclusive use of the risky
arm, and the players’ post-jump equilibrium payoffs as well as their average will be of
the form V2 (j(p)) with V2 as given in (3).
A necessary and sufficient condition for j(p∗2 ) ≥ pm is that μ2 /(μ2 + 1) ≥ λ0 /λ1 or,
equivalently, that μ2 ≥ λ0 /λ. Using (2) with N = 2, this holds if and only if
λ0 /λ
λ0 r
λ0 ≤ (7)
λ1 2
Clearly, since λ0 /λ1 < 1 and λ0 /λ > 0, a sufficient condition for (7) is that λ0 ≤ r/2.
When (7) holds, the same approach as in the previous section allows us to construct
simple equilibria with symmetric post-jump continuation values. Figure 3 shows the
best response correspondence for N = 2 and illustrates the simplest possible configura-
tion of payoff functions that can arise in the type of equilibrium we construct.
Proposition 7 (Simple MPE for N = 2). Under condition (7), the two-player experimen-
tation game admits simple Markov perfect equilibria with the following features. There
are two thresholds, p̄ and p̂, with p∗2 < p̄ < p̂ < pm , such that on ]p̂ 1], both players play
R and their payoff functions coincide; on ]p̄ p̂], the intensity of experimentation equals 1,
and there is at least one belief in the interior of this interval where both players change
action; on [0 p̄], they both play S.
We cannot rule out the possibility that the payoff of the last free-rider is non-
monotonic immediately to the right of a switch point (a belief where both players change
action).12 However, Figure 3 illustrates an equilibrium that does have a monotonic pay-
off function for the last free-rider (the higher of the two payoff functions). This case
arises, for example, with the parameter values r = 1, s = 15, h = 2, λ0 = 05, and λ1 = 15.
With these values, the lower threshold p̄ is smaller than the belief p̃2 at which all ex-
perimentation stops in the symmetric equilibrium, and the average payoff function is
greater than the common payoff function in the symmetric equilibrium; this improve-
ment stems again from the fact that the intensity of experimentation remains constant
at the level 1 just below D1/2 , whereas the symmetric equilibrium intensity falls below 1
as soon as D1/2 is crossed.13
Using condition (7), we can give the following explicit representations for the two
players’ payoff functions in the equilibria of Proposition 7.
Corollary 4. On ]p̂ 1], where both players experiment, their common payoff function
is
u(p) = λ(p)h + C (0) (1 − p)(p)μ2
the constant C (0) being given by
On ]p̄ p̂], where one player experiments and the other free-rides, and with kn = 1 for an
experimenter and kn = 0 for a free-rider, the payoff functions are
r r
un (p) = λ(p)h + (kn − 1) (λ1 h − s)p − (s − λ0 h)(1 − p)
r + λ1 r + λ0
λ0 (λ0 /λ1 )μ2
+ C (0) (1 − p)(p)μ2 + Cn(1) (1 − p)(p)(r+λ0 )/λ
r + λ0 − μ2 λ
(1)
with appropriately chosen constants of integration Cn .
Maintaining assumption (7), suppose now that we give the last experimenter a
higher payoff at j(p̄) than the other player. This has two effects. On the one hand, we
can no longer achieve the maximal intensity of K = 2 immediately to the right of the
belief at which the graph of the average payoff function crosses D1 , since there the last
free-rider’s payoff function is necessarily below D1 , implying that his best response is to
play safe; this lowers average payoffs. On the other hand, the last experimenter is now
willing to continue playing R somewhat to the left of p̄, which increases average payoffs.
We explore this trade-off numerically.
We refer to the last experimenter as player 1 and the last free-rider as player 2. We
continue to let p̄ denote the belief where all experimentation stops, and let ps denote
the switch point where both players change actions; however, as their payoff functions
cross D1 at different points, we let p̂1 and p̂2 denote the corresponding beliefs.
12 For more on this nonmonotonicity, see the remarks about anticipation near the end of this section.
13 However, it is not the case that each player is individually better off than in the symmetric MPE; in fact,
the last experimenter (the player with the lower of the two payoff functions) is worse off in a neighborhood
of the switch point. Subsequently, we present simple equilibria for the above parameter values that are
better than the symmetric one for both players.
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 297
Construction of equilibria
The strategies for the players are the following: on ]p̂2 1], both players play R; on
]p̂1 p̂2 ], player 1 plays R and player 2 plays S; on ]ps p̂1 ], player 1 plays S and player 2
plays R; on ]p̄ ps ], player 1 plays R and player 2 plays S; on [0 p̄], both players play S.14
We need to determine p̄ < ps < p̂1 < p̂2 , and to build continuous functions u1 and u2
that (a) connect the points (0 s) and (1 λ1 h) in the (p u) plane, that (b) satisfy the ap-
propriate ODDEs, and that (c) have the following properties: u1 is above D1 on ]p̂1 1],
below D1 but above s on ]p̄ p̂1 ], at s on [0 p̄], and is smooth at p̄; u2 is above D1 on
]p̂2 1], below D1 but above s on ]p̄ p̂2 ], and at s on [0 p̄].
Relative to the upper threshold p̂ and the average payoff function ū from Proposi-
tion 7, choose a point (p̂2 ǔ) in [0 1] × [s λ1 h] with p̂2 to the right of p̂ and ǔ above
ū(p̂2 ). First, we construct player 1’s payoff function piecewise.
On ]p̂2 1], both players play R, so u1 is of the form given in equation (3) with
N = 2, the constant of integration being chosen so that u1 (p̂2 ) = ǔ. The belief p̄ where
player 1 quits can now be determined from equation (1) with N = 1, using value match-
ing (u1 (p̄) = s) and smooth pasting (u1 (p̄) = 0), and knowing the form of u1 (j(p̄)) since
j(p̄) > p̂2 . Just to the left of p̂2 , player 1 is the only one playing R, so u1 is of the form
given in Corollary 4, the constant of integration being chosen to ensure continuity at p̂2 ;
p̂1 is the belief to the left of p̂2 where u1 crosses D1 . On an interval to the left of that,
player 1 is free-riding, u1 is again of the form given in Corollary 4, and the constant of
integration is chosen to ensure continuity at p̂1 . Further, on an interval to the right of p̄,
player 1 is again the lone experimenter and now the constant of integration is chosen to
ensure that u1 (p̄) = s. Player 1’s switch point is where the graph of u1 coming down and
to the left from (p̂1 u1 (p̂1 )) intersects the curve going up and to the right from (p̄ s).
Player 2’s payoff function is also constructed piecewise. On ]p̂2 1], u2 is also of the
form given in equation (3) with N = 2, the constant of integration being chosen so that
u2 (p̂2 ) is on D1 . Between p̄ and p̂2 , u2 is of the form given in Corollary 4: player 2 is free-
riding on the interval ]p̂1 p̂2 ] and we ensure continuity at p̂2 ; on an interval to the left
of that, player 2 is the lone experimenter and we ensure continuity at p̂1 ; on an interval
to the right of p̄, player 2 free-rides and the constant of integration is chosen to ensure
that u2 (p̄) = s. Player 2’s switch point is where the graph of u2 coming down and to the
left from (p̂1 u2 (p̂1 )) intersects the curve going up and to the right from (p̄ s).
For this to be an equilibrium, we need to have the players switching at the same
belief: this involves adjusting (p̂2 ǔ) and iterating until the switch points coincide.
14 Note that this strategy pair requires the players to swap roles once more than in the most inequitable
MPE of the exponential model. There, the last experimenter plays risky on the largest possible inter-
val of beliefs ]p̄ p̂s ], which makes his payoff lower than his opponent’s at all beliefs in ]p̄ 1[. This is
impossible here. In fact, optimal behavior of player 1 requires c(p̄) = b(p̄ u1 ) = λ(p̄)[u1 (j(p̄)) − s]/r
as the left derivative u1 (p̄) = 0. If player 2’s payoff were higher than player 1’s at j(p̄), we would have
b(p̄ u2 ) = λ(p̄)[u2 (j(p̄)) − s]/r > c(p̄), and player 2 would act suboptimally on [p̄ − p̄] for some > 0. To
reward player 1, therefore, we need to shorten the interval ]p̄ ps ] on which he acts as the last experimenter
and let him free-ride to the right of it. This in turn requires a further switch in actions at the belief p̂1 where
player 1’s payoff function crosses D1 .
298 Keller and Rady Theoretical Economics 5 (2010)
Findings
Using the same parameter values we referred to in the discussion of Figure 3 after Propo-
sition 7 (namely, r = 1, s = 15, h = 2, λ0 = 05, λ1 = 15), we numerically solved for six
equilibria as well as the simple one with common payoffs above D1 (the base case), giv-
ing the last experimenter progressively higher payoffs above D1 . Figure 4 illustrates the
players’ payoffs in the base case and in three of these equilibria, the tick labels on the
Figure 4. Equilibrium payoffs of the last experimenter (upper panel) and the last free-rider
(lower panel). Open circles mark points corresponding to where the other player’s value function
crosses D1 .
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 299
belief axis being for the base case (which exhibits the lowest equilibrium payoffs for the
last experimenter and highest for the last free-rider).
We find that as we improve player 1’s post-jump payoff, the fall in p̄ is about 5 times
smaller than the shifts in p̂1 and p̂2 , with p̂1 moving to the left and p̂2 moving to the right.
(The drop in the switch point is more dramatic, being about 25 times that of p̄.) The net
effect is that the interval of beliefs where exactly one player is experimenting widens,
and although the average payoff is higher on an interval to the right of p̄, it dips below
that of the base case very close to p̂1 and remains there at all higher beliefs. Player 1 is
progressively better off than in the base case at all beliefs to the right of p̄, but player 2 is
progressively worse off at beliefs greater than approximately ps , and the absolute differ-
ences between the average payoff in the base case and those in the other six equilibria
become more pronounced as the asymmetry increases. Indeed, in the two most asym-
metric of the equilibria that we calculated, player 2’s payoff function is below the payoff
function in the symmetric equilibrium in a neighborhood of p̂1 , whereas the payoffs in
the other four intermediate equilibria are Pareto improvements on the symmetric equi-
librium. Moreover, in the most asymmetric of these equilibria, player 2’s payoff function
is decreasing in an interval immediately to the right of the switch point (although this is
hard to discern visually).
Put another way, in this very asymmetric equilibrium as the players approach the
switch point from the right, where player 2 is experimenting alone, beliefs are becoming
more pessimistic, yet player 2’s payoff is going up. If we put this down to the fact that,
conditional on no impending success, player 2 will soon be able to enjoy a free ride, then
we can call this an anticipation effect.
More broadly, these findings suggest that by simultaneously letting ps tend to p̄ and
λ0 tend to 0, it is possible to construct a sequence of equilibria that converge to the most
inequitable two-player equilibrium of the exponential model. Moreover, as the idea of
rewarding the last experimenter can also be applied to the equilibria of Proposition 5,
they do not constitute uniformly best equilibria of the Poisson model—not even when
N = 2. By the same token, one is led to conjecture that in marked contrast to the expo-
nential model, the Poisson model does not admit any uniformly best MPE.
8. Concluding remarks
The asymmetric equilibria that we constructed in the Poisson framework raise the ques-
tion of whether similar equilibria exist in the Brownian model of Bolton and Harris
(1999). The elementary constructive method that we used here is likely to apply to the
Brownian case as well. Our proof of the result that there exist no equilibria in cutoff
strategies should also carry over. We intend to explore this in future work.
Our model can easily be adapted to situations where an event is bad news: a break-
down rather than a breakthrough. For example, we can interpret s as the expected flow
cost of keeping the current safe machine running. Players have access to new risky ma-
chines that break down and thereby cause lump-sum costs at exponentially distributed
times: a high failure rate of λ1 favors the old machine; a low rate of λ0 < λ1 favors the
new one. The aim is to minimize the expected sum of discounted costs of breakdowns.
300 Keller and Rady Theoretical Economics 5 (2010)
The longer the machines do not fail, the more optimistic the players become about their
reliability, but whenever one does fail, the belief jumps to more pessimistic levels. Given
enough pessimism, another failure will be the last straw. Thus, the continuous part of
the belief dynamics always keeps the state variable in the continuation region where
at least some player uses the new machine, whereas the discontinuous part can cause
the state variable to jump into the stopping region. Despite the superficial symmetry
between the good-news and the bad-news versions of the model, therefore, the formal
analysis of the single-agent optimum, the efficient benchmark, and best responses is
rather different from that in the present paper. We defer such analysis to a separate
paper.
By constructing simple asymmetric equilibria, our work also prepares the ground
for an analysis of strategic experimentation by asymmetric players who might differ, for
example, with respect to their innate abilities to achieve breakthroughs, the average size
of lump-sum payoffs, or their safe payoffs. This is again left to future work.
Appendix
Proof of Proposition 4 (Details). Let p̂NN−1 denote the belief where the graph of VN∗
cuts DN−1 and let p̂1N−1 denote the belief where the graph of V1∗ cuts DN−1 . By conti-
nuity, there is an open interval I ⊃ [p̂NN−1 p̂1N−1 ] such that for all p̂ ∈ I, the unique
solution to (1) that crosses DN−1 at the belief p̂ has positive slope there.
Fix a belief p̂ ∈ I and let (p̂ û) be the corresponding point on the diagonal DN−1 . On
[p̂ 1], we define u(0) as the unique solution to (1) that assumes the value û at belief p̂.
Now consider the ordinary differential equation (ODE)
Standard results imply that this ODE has a unique solution u(1) on [j −1 (p̂) p̂] with
u(1) (p̂) = u(0) (p̂) and, by construction, (u(1) ) (p̂) = (u(0) ) (p̂).
Iterating this step, we construct functions u(i+1) defined on [j −(i+1) (p̂) j −i (p̂)] for
i = 1 2 3 by choosing u(i+1) as the unique solution of the ODE
subject to the condition u(i+1) (j −i (p̂)) = u(i) (j −i (p̂)). Setting up̂ (p) = u(i) (p) whenever
j −(i+1) (p̂) ≤ p < j −i (p̂), we thus obtain a function up̂ of class C 1 on ]0 1] that solves (6)
to the left of p̂ and solves (1) to the right of p̂. Standard results imply that up̂ depends
in a continuous fashion on p̂. In particular, M(p̂), the minimum of up̂ on [p∗N pm ], is
continuous in p̂.
For p̂ ∈ I with p̂ < p̂NN−1 , the function up̂ lies above VN∗ on at least [p̂ 1[. If up̂ and
VN assumed the same value at some belief p ∈ [p∗N p̂[, then the restriction of up̂ − VN∗
∗
to [p 1] would have a positive global maximum at some belief pr ∈ ]p 1[. In fact, we
would have pr ∈ ]p p̂[, since up̂ − VN∗ , being the difference of two functions of the form
(3), has a negative first derivative on [p̂ 1[. As (up̂ ) (pr ) = (VN∗ ) (pr ) and up̂ (j(pr )) −
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 301
VN∗ (j(pr )) ≤ up̂ (pr ) − VN∗ (pr ), we would thus have b(pr VN∗ ) ≥ b(pr up̂ ) = c(pr ), hence
VN∗ (pr ) = s + Nb(pr VN∗ ) − c(pr ) ≥ s + (N − 1)c(pr ), which is inconsistent with the fact
that VN∗ is below DN−1 at pr . Consequently, up̂ lies above VN∗ on [p∗N 1[.
By continuity, ûN , the function up̂ obtained for p̂ = p̂NN−1 , lies weakly above VN∗
on [p∗N 1]. While the two functions are identical on [p̂NN−1 1] by construction, they
cannot be identical on the whole of [p∗N p̂NN−1 [ as VN∗ does not solve (A.1) immediately
to the left of p̂NN−1 , for example. Arguing exactly as in the previous paragraph, we see
that the restriction of ûN −VN∗ to [p∗N 1] must assume its positive global maximum at p∗N .
This establishes ûN (p∗N ) > VN∗ (p∗N ) = s. As VN∗ (p) > s for p > p∗N , we thus have ûN > s on
[p∗N 1], hence M(p̂NN−1 ) > s.
For p̂ ∈ I with p̂ > p̂1N−1 , the function up̂ lies below V1∗ in a neighborhood of p̂.
If up̂ and V1∗ assumed the same value at some belief p ∈ [p∗1 p̂[, then the restriction
of V1∗ − up̂ to [p 1] would have a positive global maximum at a belief pr ∈ ]p 1[. As
(V1∗ ) (pr ) = (up̂ ) (pr ) and V1∗ (j(pr )) − up̂ (j(pr )) ≤ V1∗ (pr ) − up̂ (pr ), we would thus have
b(pr up̂ ) ≥ b(pr V1∗ ). As s < V1∗ (pr ) = s +b(pr V1∗ )−c(pr ), this would imply b(pr up̂ ) >
c(pr ) and pr > p̂. But then up̂ (pr ) = s + Nb(pr up̂ ) − c(pr ) > s + b(pr V1∗ ) − c(pr ) =
V1∗ (pr ), which is a contradiction. Consequently, up̂ lies below V1∗ on [p∗1 p̂].
By continuity, û1N , the function up̂ obtained for p̂ = p̂1N−1 , lies weakly below V1∗
on [p∗1 p̂1N−1 ]. While the two functions are identical at p̂1N−1 by construction, they
cannot be identical on the whole of [p∗1 p̂1N−1 [. Arguing exactly as in the previous para-
graph, we see that the restriction of V1∗ − û1N to [p∗1 1] must assume its positive global
maximum at p∗1 . In particular, û1N (p∗1 ) < V1∗ (p∗1 ) = s, hence M(p̂1N−1 ) < s.
So there exists a p†N ∈ ]p̂NN−1 p̂1N−1 [ such that M(p†N ) = s. With u† denoting the
solution up̂ corresponding to p̂ = p†N , let p̃N be the highest belief in [p∗N pm ] at which
u† assumes the value s. By construction, p̃N < p†N < pm . Define the function WN by
WN (p) = s on [0 p̃N ] and by WN (p) = u† (p) > s on ]p̃N 1]. This is the common payoff
function when all players use the strategy k described in the proposition. As a conse-
quence, WN ≤ VN∗ and, in particular, p̃N ≥ p∗N .
If we had p̃N = p∗N , then WN (p∗N ) = s = VN∗ (p∗N ), WN (j(p∗N )) ≤ VN∗ (j(p∗N )), and
WN (p∗N −) = 0 = (VN∗ ) (p∗N ), implying b(p∗N VN∗ ) ≥ b(p∗N WN ). As b(p∗N VN∗ ) = c(p∗N )/N,
b(p∗N WN ) = c(p∗N ), and c(p∗N ) > 0, this is a contradiction. So we have p∗N < p̃N < pm ,
hence WN (p̃N +) = (u† ) (p̃N ) = 0 because the minimum of u† on [p∗N pm ] is achieved at
an interior point. Thus, the function WN is of class C 1 .
It is straightforward to check from the explicit representation of WN above DN−1
that this function is convex and increasing on [p†N 1]. Suppose WN is not increasing
on [p̃N p†N ]. Then it must assume both a local maximum and a local minimum in the
interior of that interval, and there exist beliefs p < p in ]p̃N p†N [ such that WN (p ) =
WN (p ) = 0, WN (p ) ≥ WN (p ), and WN is weakly decreasing on [p p ] and increasing
on [p 1]. We now have b(p WN ) = λ(p )[WN (j(p )) − WN (p )]/r = c(p ) > 0, hence
WN (j(p )) > WN (p ) and j(p ) > p . As a consequence, WN (j(p )) > WN (j(p )) and
b(p WN ) = λ(p )[WN (j(p )) − WN (p )]/r > λ(p )[WN (j(p )) − WN (p )]/r = c(p ) >
c(p ), which is a contradiction. This establishes that WN is increasing on [p̃N 1], and
k is increasing on [p̃N p†N ].
302 Keller and Rady Theoretical Economics 5 (2010)
We thus have b(p WN ) > c(p) on ]p†N 1], b(p WN ) = c(p) on [p̃N p†N ] and, because
of the monotonicity of WN on [p̃N 1], b(p WN ) < c(p) on [0 p̃N [. So all players using
the strategy k constitutes an equilibrium. Finally, p̃N < p∗1 by Proposition 2.
Uniqueness was already shown in the main text.
Proof of Corollary 1. With u(0) (p) = λ1 hp + λ0 h(1 − p) + C (0) (1 − p)(p)μ (see (3)),
we seek a sequence of functions u(i+1) for i = 0 1 , defined recursively as solutions to
the ODE (A.2). Let α = λ0 /λ and, for i ≥ 0, let
(i) (i)
u(i) (p) = d1 p + d0 (1 − p) + m(i) (1 − p)(p)μ
i−1
η
+ (1 − p)(p)α l(i−η) ln[(λ0 /λ1 )η−1 (p)]
η=0
(i) (i)
where d1 , d0 , m(i) , and l(i−η) are constants to be determined—we will show that the
functions u(i) form just such a sequence. Clearly we need
(0) (0)
d1 = λ1 h d0 = λ0 h and m(0) = C (0)
with C (0) being the constant that fixes payoffs above the diagonal where everyone
plays R. The final (summed) term in the above equation defining u(i) is vacuous for
i = 0.
First note that
μ
(i) λ1 (i) λ0 λ0 λ0
u(i) (j(p)) = d1 p + d0 (1 − p) + m(i) (1 − p)(p)μ
λ(p) λ(p) λ(p) λ1
α i−1
λ0 λ0 η
+ (1 − p)(p)α l(i−η) ln[(λ0 /λ1 )η (p)]
λ(p) λ1
η=0
where
(i) (i) (i) (i)
D1 = d1 λ1 + r(λ1 h − s) D0 = d0 λ0 − r(s − λ0 h)
and
M (i) = m(i) λ0 (λ0 /λ1 )μ L(i−η) = l(i−η) λ0 (λ0 /λ1 )α
The homogeneous equation, λp(1 − p)u (p) + λ(p)u(p) = 0, has the solution
u0 (p) = (1 − p)(p)α
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 303
Using the method of variation of constants, we now write u(p) = a(p)u0 (p) so that
The ODE thus transforms into the following equation for the first derivative of the un-
known function a:
G(i) (p)
λa (p) =
p(1 − p)u0 (p)
= D1 (p)−α (1 − p)−2 + D0 (p)−α+1 (1 − p)−2 + M (i) (p)μ−α+1 (1 − p)−2
(i) (i)
i−1
η
+ (p)(1 − p)−2 L(i−η) ln[(λ0 /λ1 )η (p)]
η=0
Make the substitution ω = (p) and define A(ω) = a(p), so a (p) = −A (ω)/p2 . Then
i−1
η
−λA (ω) = D1 ω−α−2 + D0 ω−α−1 + M (i) ωμ−α−1 + ω−1
(i) (i)
L(i−η) ln[(λ0 /λ1 )η ω]
η=0
so
(i) (i)
D1 −α−1 D0 −α M (i)
A(ω) = ω + ω + ωμ−α
λ1 λ0 λ0 − μλ
i−1
L(i−η) η+1
− ln[(λ0 /λ1 )η ω] + C (i+1)
(η + 1)λ
η=0
and
i
λ0 (λ0 /λ1 )μ
m(i) = C (0)
λ0 − μλ
After a little algebra, we find that the constants in the summation are given by
η
C (i−η) λ0 (λ0 /λ1 )α
l(i−η) = − for η = 0 i − 1
η! λ
η+1
1 λ0 (λ0 /λ1 )α
− − ln[(λ0 /λ1 )η (ĵ−i )]
(η + 1)! λ
× (1 − ĵ−i )(ĵ−i )α
Proof of Corollary 2. Close to the right of p̃N , the dynamics of the belief p given no
success are
N WN (p) − s
dp = −λ p(1 − p) dt
N −1 c(p)
(A success merely causes a delay before the belief decays to near p̃N again.) As WN is
of class C 2 to the right of p̃N with WN (p̃N ) = s, WN (p̃N ) = 0, and WN (p̃N +) ≥ 0, we can
find a positive constant C such that
N WN (p) − s
λ p(1 − p) < C(p − p̃N )2
N −1 c(p)
in a neighborhood of p̃N . Starting from an initial belief p0 > p̃N in this neighborhood,
consider the dynamics dp = −C(p − p̃N )2 dt. The solution with initial value p0 ,
1
pt = p̃N +
Ct + (p0 − p̃N )−1
does not reach p̃N in finite time. Since the modified dynamics decrease faster than the
original, this result carries over to the true evolution of beliefs.
Proof of Proposition 5 (Details). Let p̂NN−1 denote the belief where the graph of VN∗
cuts DN−1 , and consider I = [p̂NN−1 − pm ] with > 0 small enough that for all p̂ ∈ I,
the unique solution to (1) that crosses DN−1 at the belief p̂ has positive slope there.
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 305
Step 1: Construction of the average payoff function. Fix a belief p̂ ∈ I. On [p̂ 1], we
define up̂ as the unique solution to (1) that starts on DN−1 at p̂. Starting from this initial
condition, we then proceed iteratively as in the proof of Proposition 4, solving “forward”
towards lower beliefs and eventually to p∗N . Between DN−1 and D̄ ∧ D1−1/N , we solve the
indifference ODDE (6); below D̄ ∧ D1−1/N , we solve the ODDE
r
λp(1 − p)u (p) = λ(p) u(j(p)) − u(p) − r[u(p) − s] + [λ(p)h − s]
N
In this manner, we obtain a continuous function up̂ on [p∗N 1] such that (i) up̂ (p) =
s + Nb(p up̂ ) − c(p) and b(p up̂ ) > c(p) on ]p̂ 1]; (ii) b(p up̂ ) = c(p) at all beliefs
p ∈ ]p∗N p̂] where the point (p up̂ (p)) lies on or below DN−1 and above D̄ ∧ D1−1/N ;
(iii) up̂ (p) = s + b(p up̂ ) − c(p)/N and b(p up̂ ) ≤ c(p) at all beliefs p ∈ ]p∗N p̂] where
(p up̂ (p)) lies on or below D̄ ∧ D1−1/N .
Again proceeding as in the proof of Proposition 4, we establish the existence of a
p̂ ∈ ]p̂NN−1 pm [ such that the corresponding function up̂ has an interior global min-
imum equal to s at some belief p̆ ∈ ]p∗N p̂[. As up̂ (p̆) = 0, we have λ(p̆)[up̂ (j(p̆)) −
s]/r = b(p̆ up̂ ) = c(p̆)/N < c(p̆). For p̂ = pm , alternatively, the corresponding func-
tion up̂ assumes value s at pm . As its slope there is positive and c(pm ) = 0, we have
λ(pm )[up̂ (j(pm )) − s]/r > b(pm up̂ ) = c(pm )/N = c(pm ). By continuity of up̂ with re-
spect to p̂, there exists p‡N ∈ ]p̂NN−1 pm [ such that up‡ , the function up̂ obtained for
N
p̂ = p‡N , has the following property: there is a belief pN ∈ ]p̆ p‡N [ such that up‡ (pN ) = s,
N
up‡ (p) > s for p > pN , and λ(pN )[up‡ (j(pN )) − s]/r = c(pN ).
N N
We define a function ū on [0 1] by taking ū = up‡ on [pN 1] and ū = s everywhere
N
else. We want to establish that ū is increasing on [pN 1]. The explicit representa-
tion (3) makes this obvious on [p‡N 1]. Moreover, the argument given in the proof of
Proposition 4 shows that ū is also increasing on [pN p‡N ], where pN is the rightmost
belief at which the graph of ū crosses D̄ ∧ D1−1/N . Suppose now that ū is not increas-
ing on [pN pN ]. Then there exist beliefs p < p in ]pN pN ] such that ū (p −) ≥ 0,
ū (p −) ≤ 0, and ū is weakly decreasing on [p p ]. As j(p ) > j(p ) > j(pN ) ≥ pN ,
we have ū(j(p )) > ū(j(p )), hence ū(j(p )) − ū(p ) > ū(j(p )) − ū(p ) and b(p ū) >
b(p ū). This implies ū(p ) = s + b(p ū) − c(p )/N < s + b(p ū) − c(p )/N = ū(p )—a
contradiction.
Monotonicity immediately implies that ū is the average payoff function associ-
ated with the intensity of experimentation: K(p) = N for p ≥ p‡N ; K(p) = N ū(p)/
[(N − 1)c(p)] < N for pN < p < p‡N ; K(p) = 1 for pN < p ≤ pN ; K(p) = 0 for p ≤ pN .
Using the explicit form of the relevant ODDE to the left and right of pN , respectively, we
see that λpN (1 − pN )[ū (pN +) − ū (pN −)] = r[ū(pN ) − s − (1 − 1/N)c(pN )], so ū has
a kink at pN with ū (pN −) > ū (pN +) if and only if the intersection with D̄ ∧ D1−1/N is
below D1−1/N . This kink then corresponds to a jump in the intensity of experimentation
with K(pN −) = 1 > K(pN +). By construction, K always jumps at pN and ū always has a
kink there. At all other beliefs, K is continuous and ū is once continuously differentiable.
Step 2: Construction of the players’ payoff functions and strategies. We define
b̄(p u) = λ(p) ū(j(p)) − u(p) − λp(1 − p)u (p) /r
306 Keller and Rady Theoretical Economics 5 (2010)
for any left-differentiable real-valued function u on ]0 1]. (This is the benefit of experi-
mentation when the value after a success is given by the payoff function ū.)
Fix any two beliefs p < pr in [pN pN ] and consider the four functions uF , uE ,
urF , and urE on [p pr ] that are uniquely determined by the properties uF (p ) =
uE (p ) = ū(p ); urF (pr ) = urE (pr ) = ū(pr ); on ]p pr ], uF and urF solve the free-
rider ODE u(p) = s + b̄(p u), while uE and urE solve the experimenter ODE u(p) =
s + b̄(p u) − c(p). By construction, [(N − 1)uF + uE ]/N coincides with ū at p and
solves the same ODE as ū on ]p pr ], namely u(p) = s + b̄(p u) − c(p)/N, so it must
coincide with ū on [p pr ]. The same argument applies to [(N − 1)urF + urE ]/N. We can
thus conclude that (N − 1)uF + uE = (N − 1)urF + urE on [p pr ].
Next, we have uF (p +) > ū (p +) since limp↓p b̄(p uF ) = b̄(p ū) − c(p )/N <
b̄(p ū) and limp↓p [ū(j(p)) − uF (p)] = ū(j(p )) − ū(p ). Thus, uF (p) > ū(p) im-
mediately to the right of p . Now there cannot exist a belief p ∈ ]p pr ] such that
uF (p ) = ū(p ) and uF (p ) ≤ ū (p ), because we would then have c(p )/N = b(p ū) −
b̄(p uF ) = −λp (1 − p )[ū (p ) − uF (p )]/r ≤ 0—a contradiction. This implies that
uF > ū on the entire interval ]p pr ]. Analogous arguments establish that uE < ū on
]p pr ] as well as urF < ū and urE > ū on [p pr [.
In particular, there exists a belief p ∈ ]p pr [ such that uE (p) = urF (p). Let p1 de-
note the lowest such belief and define a continuous function u1 (·|p pr ) on [p pr ] by
setting u1 (·|p pr ) = uE on [p p1 ] and u1 (·|p pr ) = urF on [p1 pr ]. Using the identity
(N − 1)uF + uE = (N − 1)urF + urE , we see that urE (p1 ) − uF (p1 ) = (N − 2)[uF (p1 ) −
urF (p1 )]. If N = 2, we define a continuous function u2 (·|p pr ) on [p pr ] by setting
u2 (·|p pr ) = uF on [p p1 ] and u2 (·|p pr ) = urE on [p1 pr ].
If N > 2, we consider the function u[2] that coincides with uF at p1 and solves the
experimenter ODE u(p) = s + b̄(p u) − c(p) on ]p1 pr ]. As urE (p1 ) > uF (p1 ), there
is a belief p ∈ ]p1 pr [ such that u[2] (p) = urF (p). Let p2 denote the lowest such belief
and define a continuous function u2 (·|p pr ) on [p pr ] by setting u2 (·|p pr ) = uF on
[p p1 ], u2 (·|p pr ) = u[2] on [p1 p2 ], and u2 (·|p pr ) = urF on [p2 pr ]. By the same
argument as above, (N − 2)uF + u[2] + urF = (N − 1)urF + urE on [p1 p2 ], which is
easily seen to imply urE (p2 ) − uF (p2 ) = (N − 3)[uF (p2 ) − urF (p2 )]. If N = 3, we define
a continuous function u3 (·|p pr ) on [p pr ] by setting u3 (·|p pr ) = uF on [p p2 ]
and u3 (·|p pr ) = urE on [p2 pr ].
If N > 3, we proceed as in the previous paragraph to determine a belief p3 ∈ ]p2 pr [,
and determine a continuous function u3 (·|p pr ) that coincides with uF on [p p2 ],
solves the experimenter ODE on ]p2 p3 ], and coincides with urF on [p3 pr ]. Perform-
ing as many steps as necessary, we end up with beliefs p0 = p < p1 < p2 < · · · < pN−1 <
pN = pr and continuous functions u1 (·|p pr ) uN (·|p pr ) on [p pr ] such that
un (·|p pr ) coincides with uF on [p pn−1 ], solves the experimenter ODE on ]pn−1 pn ],
and coincides with urF on [pn pr ]. By construction, the average of these N functions
coincides with ū.
Now consider a finite family of contiguous intervals ]pi pri ] whose union equals
]pN pN ]. For each of these intervals, let pni denote the corresponding belief pn as de-
termined in the previous paragraph. Define functions u1 uN on the unit interval by
setting un = s on [0 pN ], un = un (·|pi pri ) on ]pi pri ], and un = ū on ]pN 1]. For
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 307
Proof of Corollary 3. The proof follows the same lines as for Corollary 1. Here we
consider the case where exactly 1 of the N players is playing R and where a success
results in the players playing symmetrically as in that corollary. The relevant ODE is
with k = 0 for a free-rider and k = 1 for an experimenter, and where w is the function
u(ι) derived in Corollary 1.
We obtain equations for un of the form
(ι+1) (ι+1)
un (p) = d1 (kn )p + d0 (kn )(1 − p) + m(ι+1) (1 − p)(p)μN
ι−1
1
+ l(ι−η) λ0 (λ0 /λ1 )λ0 /λ
r
η=0
η η−γ
λ η! γ
γ
× ln[(λ0 /λ1 ) (p)] (1 − p)(p)λ0 /λ
r γ!
γ=0
where
(ι+1) λ1 r r
d1 (k) = λ1 h + (λ1 h − s)ι + [kλ1 h + (1 − k)s]
r + λ1 λ1 r + λ1
(ι+1) λ0 r r
d0 (k) = λ0 h − (s − λ0 h)ι + [kλ0 h + (1 − k)s]
r + λ0 λ0 r + λ0
ι
λ0 (λ0 /λ1 )μ λ0 (λ0 /λ1 )μ
m(ι+1) = C (0)
λ0 − μλ r + λ0 − μλ
308 Keller and Rady Theoretical Economics 5 (2010)
and
η
C (ι−η) λ0 (λ0 /λ1 )λ0 /λ
l(ι−η) = − for η = 0 ι − 1
η! λ
This leads to the representation stated in the corollary.
Proof of Proposition 7. The proof proceeds in four steps, two of which are simpler
versions of the corresponding steps in the proof of Proposition 5. Let p̂21 denote the
belief where the graph of V2∗ cuts D1 , and consider I = [p̂21 − pm ] with > 0 small
enough that for all p̂ ∈ I, the unique solution to (1) with N = 2 that crosses D1 at the
belief p̂ has positive slope there.
Step 1: Construction of the average payoff function. Fix a belief p̂ ∈ I. On [p̂ 1], we
define up̂ as the unique solution to (1) with N = 2 that starts on D1 at p̂. On [p∗2 p̂[, we
define up̂ as the unique solution to the ODE
that ends on D1 at p̂. By construction, up̂ is continuous, up̂ (p) = s + 2b(p up̂ ) − c(p) on
]p̂ 1], and up̂ (p) = s + b(p up̂ ) − 12 c(p) on ]p∗2 p̂].
Proceeding as in the proof of Proposition 5, we establish the existence of a p̂ ∈
]p̂21 pm [ such that the corresponding function up̂ has the property that there is a belief
p̄ ∈ ]p∗2 p̂[ such that up̂ (p̄) = s, up̂ (p) > s for p > p̄, and λ(p̄)[up̂ (j(p̄)) − s]/r = c(p̄).
We define a function ū on [0 1] by taking ū equal to the function up̂ just determined
on [p̄ 1], and ū = s everywhere else. We want to establish that ū is increasing on [p̄ 1].
The explicit representation (3) makes this obvious on [p̂ 1]. Suppose now that ū is not
increasing on [p̄ p̂]. Then there exist beliefs p < p in ]p̄ p̂] such that ū (p −) ≥ 0,
ū (p −) ≤ 0, and ū is weakly decreasing on [p p ]. As j(p ) > j(p ) > pm , we have
ū(j(p )) > ū(j(p )), hence ū(j(p )) − ū(p ) > ū(j(p )) − ū(p ) and b(p ū) > b(p ū).
This implies ū(p ) = s + b(p ū) − c(p )/2 < s + b(p ū) − c(p )/2 = ū(p )—a contra-
diction.
Step 2: Construction of the players’ payoff functions and strategies. We define
b̄(p u) = λ(p) ū(j(p)) − u(p) − λp(1 − p)u (p) /r
Now consider a finite family of contiguous intervals ]pi pri ] whose union equals
]p̄ p̂]. For each of these intervals, let psi denote the corresponding belief ps as deter-
mined above. Define functions u1 and u2 on the unit interval by setting un = s on [0 p̄],
un = un (·|pi pri ) on ]pi pri ], and un = ū on ]p̂ 1]. Define a simple strategy k1 by
setting k1 (p) = 1 if and only if p lies in ]p̂ 1] or one of the intervals ]pi psi ], and a
simple strategy k2 by setting k2 (p) = 1 if and only if p lies in ]p̂ 1] or one of the inter-
vals ]psi pri ]. Clearly, u1 and u2 are the payoff functions associated with the strategies
k1 and k2 , and ū is the corresponding average payoff function. By construction, u1 is
differentiable at p̄ with u1 (p̄) = 0.
Step 3: Establishing that u2 (p̂) > −λh. Unlike u1 , the function u2 is not necessarily
increasing on [p̄ p̂], so we do not know whether its graph lies below the diagonal D1 to
the left of p̂, which will be important to establish the mutual best-response property in
Step 4 below. Our next aim, therefore, is to show that u2 (p̂) > −λh, implying that u2
stays below D1 to the immediate left of p̂.
We have u2 (p̂) = s + b̄(p̂ u2 ) − c(p̂) = s + c(p̂), hence b̄(p̂ u2 ) = 2c(p̂) and λp̂ ×
(1 − p̂)u2 (p̂) = λ(p̂)[ū(j(p̂)) − s − c(p̂)] − 2rc(p̂). As ū(p) = λ(p)h + 2c(p̂)(1 − p)/
(1 − p̂)((p)/ (p̂))μ2 on [p̂ 1], equation (2) for N = 2 implies λ(p̂)ū(j(p̂)) = λ21 hp̂ +
λ20 h(1 − p̂) + 2[r/2 + λ0 − μ2 λ]c(p̂). Straightforward computations now reveal that
r c(p̂)
u2 (p̂) = λh − 2 μ2 + + p̂
2λ p̂(1 − p̂)
and that u2 (p̂) > −λh if and only if p̂ is larger than
r
+ (s − λ0 h) μ2 + 2λ
p = r
λ1 h − s + λh μ2 + 2λ
which is easily seen to lie between p∗2 and pm . As p̂ > p̂21 (the belief where the graph
of V2∗ cuts D1 ), a sufficient condition for u2 (p̂) > −λh is that p̂21 ≥ p+ or, equivalently,
V2∗ (p+ ) ≤ s + c(p+ ). Further computations show that the latter is the case if and only if
μ2
r r
μ2 + 2λ +1 μ2 μ2 + 2λ +1
r ≤ 2
μ2 + 1 μ2 + 1 μ2 + 2λ
Now, since r/(2λ) < μ2 and the function h(x) = (μ2 +x+1)μ2 +1 (μ2 +x)−μ2 is increasing
for x ≥ 0, the left-hand side of this inequality is bounded above by
μ2
2μ2 + 1 μ2 2μ2 + 1
μ2 + 1 μ2 + 1 2μ2
which is clearly smaller than 2.
Step 4: Ensuring mutually best responses. As u1 is increasing on [p̄ 1], player 1 is
easily seen to play a best response against k2 , irrespective of the choice of intervals
[pi pri ]. First, u1 is above D1 on ]p̂ 1]. Second, it is above s and below D1 on ]p̄ p̂[.
Third, we have b(p̄ u1 ) = λ(p̄)[u1 (j(p̄)) − s]/r = λ(p̄)[ū(j(p̄)) − s]/r = c(p̄) as the left-
hand derivative of u1 at p̄ is zero. Fourth, u1 (j(p)) is at least weakly increasing and c(p)
is decreasing on [0 p̄[; therefore, b(p u1 ) < c(p) on this interval.
310 Keller and Rady Theoretical Economics 5 (2010)
Turning to player 2, the fact that u2 (p̂) > −λh allows us to choose a finite family
of intervals [pi pri ] for any δ > 0 such that the graph of u2 is below the diagonal D1
on ]p̄ p̂[ and the vertical distance u2 − u1 is at most δ at any belief in this interval (and
hence on [0 1]). If we take δ sufficiently small, player 2 is now also seen to play a best re-
sponse. On [p̄ 1], the arguments are exactly the same as for player 1. On [j −1 (p̂) p̄[,
u2 (j(p)) = ū(j(p)) is increasing and c(p) is decreasing, hence b(p u2 ) < c(p). On
[0 j −1 (p̂)[, finally, u2 (j(p)) ≤ u1 (j(p)) + δ and b(p u1 ) ≤ b(j −1 (p̂) u1 ) < c(j −1 (p̂)),
hence b(p u2 ) ≤ b(p u1 ) + λ(p)δ/r < c(j −1 (p̂)) < c(p) for δ sufficiently small.
Proof of Corollary 4. The proof parallels that of Corollary 1. Here we consider the
general case where K of the N players are playing R, for which the relevant ODE is
r r r
λp(1 − p)u (p) + + λ(p) u(p) = k λ(p)h + (1 − k) s + λ(p)u(0) (j(p))
K K K
with k = 0 for a free-rider and k = 1 for an experimenter. As in the proof of Corollary 1,
we take u(0) (p) = λ(p)h + C (0) (1 − p)(p)μN with C (0) being the constant that fixes pay-
offs above the diagonal where everyone plays R. Having noted that under condition (7)
the recursion ends after just one iteration, i.e., with u(1) , we obtain equations for un of
the form
λ1 rK −1
un (p) = λ1 h + kn (p)λ1 h + (1 − kn (p))s p
rK −1 + λ1 rK −1 + λ1
λ0 rK −1
+ λ0 h + kn (p)λ0 h + (1 − kn (p))s (1 − p)
rK −1
+ λ0 rK −1 + λ0
λ0 (λ0 /λ1 )μN −1 +λ
+ C (0) −1
(1 − p)(p)μN + Cn(1) (1 − p)(p)(rK 0 )/λ
rK + λ0 − μN λ
and setting K = 1, N = 2 leads to the representations stated in the corollary.
References
Bellman, Richard and Kenneth L. Cooke (1963), Differential-Difference Equations. Aca-
demic Press, New York. [283]
Bergemann, Dirk and Juuso Välimäki (2008), “Bandit problems.” In The New Palgrave
Dictionary of Economics (Steven N. Durlauf and Larry E. Blume, eds.), Palgrave Macmil-
lan, Basingstoke. [279]
Besanko, David and Jianjun Wu (2008), “The impact of market structure and learning
on the tradeoff between research competition and cooperation.” Unpublished paper,
School of Management, Northwestern University. [280]
Bolton, Patrick and Christopher Harris (1999), “Strategic experimentation.” Economet-
rica, 67, 349–374. [276, 279, 283, 284, 285, 288, 299]
Bolton, Patrick and Christopher Harris (2000), “Strategic experimentation: The undis-
counted case.” In Incentives, Organization, and Public Economics (Peter J. Hammond
and Gareth D. Myles, eds.), 53–68, Oxford University Press, Oxford. [277, 279]
Theoretical Economics 5 (2010) Strategic experimentation with Poisson bandits 311
Cohen, Asaf and Eilon Solan (2009), “Bandit problems with Lévy payoff processes.” Un-
published paper, Department of Statistics and Operations Research, Tel Aviv University.
Available at http://arxiv.org/abs/0906.0835v1. [283]
Davis, Mark H. A. (1993), Markov Models and Optimization. Chapman & Hall, London.
[280]
Décamps, Jean-Paul and Thomas Mariotti (2004), “Investment timing and learning ex-
ternalities.” Journal of Economic Theory, 118, 80–102. [280]
Hopenhayn, Hugo and Francesco Squintani (2008), “Preemption games with private in-
formation.” Unpublished paper, Department of Economics, University of Essex. [280]
Keller, Godfrey, Sven Rady, and Martin Cripps (2005), “Strategic experimentation with
exponential bandits.” Econometrica, 73, 39–68. [276, 277, 278, 279, 280, 282, 283, 284,
285, 294]
Klein, Nicolas (2010), “Strategic learning in teams.” Unpublished paper, Munich Gradu-
ate School of Economics. [279]
Klein, Nicolas and Sven Rady (2008), “Negatively correlated bandits.” Unpublished pa-
per, University of Munich. [279]
Malueg, David A. and Shunichi O. Tsutsui (1997), “Dynamic R&D competition with
learning.” RAND Journal of Economics, 28, 751–772. [279]
Moscarini, Giuseppe and Francesco Squintani (2007), “Competitive experimentation
with private information: The survivor’s curse.” Unpublished paper, University of Essex.
[280]
Presman, Ernst L. (1990), “Poisson version of the two-armed bandit problem with dis-
counting.” Theory of Probability and Its Applications, 35, 307–317. [280]
Presman, Ernst L. and Isaac M. Sonin (1990), Sequential Control With Incomplete In-
formation: The Bayesian Approach to Multi-Armed Bandit Problems. Academic Press,
London. [280]
Rothschild, Michael (1974), “A two-armed bandit theory of market pricing.” Journal of
Economic Theory, 9, 185–202. [279]
Strulovici, Bruno (2010), “Learning while voting: Determinants of collective experimen-
tation.” Econometrica, 78, 933–971. [279]