0% found this document useful (0 votes)
73 views37 pages

Poker Hanabi

jurnal penelitian yang di lakukan oleh Nolan Bard2,∗, Jakob N. Foerster1,∗, Sarath Chandar3, Neil Burch2, Marc Lanctot2, H. Francis Song2, Emilio Parisotto4, Vincent Dumoulin3, Subhodeep Moitra3, Edward Hughes2, Iain Dunning2, Shibl Mourad2, Hugo Larochelle3, Marc G. Bellemare3, Michael Bowling2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views37 pages

Poker Hanabi

jurnal penelitian yang di lakukan oleh Nolan Bard2,∗, Jakob N. Foerster1,∗, Sarath Chandar3, Neil Burch2, Marc Lanctot2, H. Francis Song2, Emilio Parisotto4, Vincent Dumoulin3, Subhodeep Moitra3, Edward Hughes2, Iain Dunning2, Shibl Mourad2, Hugo Larochelle3, Marc G. Bellemare3, Michael Bowling2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

The Hanabi Challenge: A New Frontier for AI Research

Nolan Bard2,∗, Jakob N. Foerster1,∗, Sarath Chandar3 , Neil Burch2 , Marc


Lanctot2 , H. Francis Song2 , Emilio Parisotto4 , Vincent Dumoulin3 , Subhodeep
Moitra3 , Edward Hughes2 , Iain Dunning2 , Shibl Mourad2 , Hugo Larochelle3 ,
Marc G. Bellemare3 , Michael Bowling2
arXiv:1902.00506v1 [cs.LG] 1 Feb 2019

Abstract
From the early days of computing, games have been important testbeds for
studying how well machines can do sophisticated decision making. In recent
years, machine learning has made dramatic advances with artificial agents reach-
ing superhuman performance in challenge domains like Go, Atari, and some
variants of poker. As with their predecessors of chess, checkers, and backgam-
mon, these game domains have driven research by providing sophisticated yet
well-defined challenges for artificial intelligence practitioners. We continue this
tradition by proposing the game of Hanabi as a new challenge domain with novel
problems that arise from its combination of purely cooperative gameplay and
imperfect information in a two to five player setting. In particular, we argue
that Hanabi elevates reasoning about the beliefs and intentions of other agents
to the foreground. We believe developing novel techniques capable of imbuing
artificial agents with such theory of mind will not only be crucial for their success
in Hanabi, but also in broader collaborative efforts, and especially those with
human partners. To facilitate future research, we introduce the open-source
Hanabi Learning Environment, propose an experimental framework for the re-
search community to evaluate algorithmic advances, and assess the performance
of current state-of-the-art techniques.
Keywords: multi-agent learning, challenge paper, reinforcement learning,
games, theory of mind, communication, imperfect information, cooperative

∗ Equal Contribution
Email addresses: nbard@google.com (Nolan Bard), jakob.foerster@cs.ox.ac.uk
(Jakob N. Foerster), chandar@google.com (Sarath Chandar), burchn@google.com (Neil
Burch), lanctot@google.com (Marc Lanctot), songf@google.com (H. Francis Song),
eparisot@cs.cmu.edu (Emilio Parisotto), vdumoulin@google.com (Vincent Dumoulin),
smoitra@google.com (Subhodeep Moitra), edwardhughes@google.com (Edward Hughes),
iaindunning@gmail.com (Iain Dunning), shibl@google.com (Shibl Mourad),
hugolarochelle@google.com (Hugo Larochelle), bellemare@google.com (Marc G.
Bellemare), bowlingm@google.com (Michael Bowling)
1 University of Oxford, work done at DeepMind
2 DeepMind
3 Google Brain
4 Carnegie Mellon University, work done at Google Brain

Preprint submitted to Artificial Intelligence February 4, 2019


2010 MSC: 00-01, 99-00

1. Introduction

Throughout human societies, people engage in a wide range of activities


with a diversity of other people. These multi-agent interactions are integral
to everything from mundane daily tasks, like commuting to work, to operating
the organizations that underpin modern life, such as governments and economic
markets. With such complex multi-agent interactions playing a pivotal role in
human lives, it is essential for artificially intelligent agents to be capable of
cooperating effectively with other agents, particularly humans.
Multi-agent environments present unique challenges relative to those with a
single agent. In particular, the ideal behaviour for an agent typically depends
on how the other agents act. Thus, for an agent to maximize its utility in such a
setting, it must consider how the other agents will behave, and respond appro-
priately. Other agents typically are the most complex part of the environment:
their policies are commonly stochastic, dynamically changing, or dependent on
hidden information that is not observed by everyone. Furthermore, agents gen-
erally need to interact while only having a limited time to observe others.
While these issues have typically made inferring the behaviour of others a
daunting challenge for AI practitioners, humans routinely make such inferences
in their social interactions using theory of mind [1, 2]: reasoning about others as
agents with their own mental states – such as perspectives, beliefs, and intentions
– to explain and predict their behaviour. 5 Alternatively, one can think of theory
of mind as the human ability to imagine the world from another person’s point
of view. For example, a simple real-world use of theory of mind can be observed
when a pedestrian crosses a busy street. Once some traffic has stopped, a driver
approaching the stopped cars may not be able to directly observe the pedestrian.
However, they can reason about why the other drivers have stopped, and infer
that a pedestrian is crossing.
In this work, we examine the popular card game Hanabi, and argue for it as
a new research frontier that, at its very core, presents the kind of multi-agent
challenges that humans solve using theory of mind. Hanabi won the prestigious
Spiel des Jahres award in 2013 and enjoys an active community, including a
number of sites that allow for online gameplay [4, 5]. Hanabi is a cooperative
game of imperfect information for two to five players, best described as a type of
team solitaire. The game’s imperfect information arises from each player being
unable to see their own cards (i.e. the ones they hold and can act on), each of
which has a color and rank. To succeed, players must coordinate to efficiently
reveal information to their teammates, however players can only communicate
though grounded hint actions that point out all of a player’s cards of a chosen

5 Dennett uses the phrase intentional stance to refer to this “strategy” for explanation and

prediction [3].

2
rank or colour. Importantly, performing a hint action consumes the limited re-
source of information tokens, making it impossible to fully resolve each player’s
uncertainty about the cards they hold based on this grounded information alone.
To overcome this limitation, successful play involves communicating extra infor-
mation implicitly through the choice of actions themselves, which are observable
by all players. In Section 2 we provide a more detailed description of the game
and what human gameplay looks like.
Hanabi is different from the adversarial two-player zero-sum games where
computers have reached super-human skill, e.g., chess [6], checkers [7], go [8],
backgammon [9] and two-player poker [10, 11]. In those games, agents typically
compute an equilibrium policy (or equivalently, a strategy) such that no sin-
gle player can improve their utility by deviating from the equilibrium. While
two-player zero-sum games can have multiple equilibria, different equilibria are
interchangeable: each player can play their part of different equilibrium profiles
without impacting their utility. As a result, agents can achieve a meaningful
worst-case performance guarantee in these domains by finding any equilibrium
policy. However, since Hanabi is neither (exclusively) two-player nor zero-sum,
the value of an agent’s policy depends critically on the policies used by its team-
mates. Even if all players manage to play according to the same equilibrium,
there can be multiple locally optimal equilibria that are relatively inferior. For
algorithms that iteratively train independent agents, such as those commonly
used in the multi-agent reinforcement learning literature, these inferior equilib-
ria can be particularly difficult to escape.
The presence of imperfect information in Hanabi creates another challenging
dimension of complexity for AI algorithms. As has been observed in domains like
poker, imperfect information entangles how an agent should behave across mul-
tiple observed states [12, 13]. In Hanabi, this manifests in the communication
protocol 6 between players, where the efficacy of any given protocol depends on
the entire scheme rather than how players communicate in a particular observed
situation. That is, how the other players will respond to a chosen signal will de-
pend upon what other situations use the same signal. Due to this entanglement,
the type of single-action exploration techniques common in reinforcement learn-
ing (e.g., -greedy, entropy regularization) can incorrectly evaluate the utility of
such exploration steps as they ignore their holistic impact.
Hanabi’s cooperative nature also complicates the question of what kind of
policy practitioners should seek. One challenge is to discover a policy for the
entire team that has high utility. Most of the prior research on Hanabi has

6 In pure signalling games where actions are purely communicative, policies are often re-

ferred to as communication protocols. Though Hanabi is not such a pure signalling game,
when we want to emphasize the communication properties of an agent’s policy we will still
refer to its communication protocol. We will use the word convention to refer to the parts
of a communication protocol or policy that interrelate, such as the situations when a signal
is used and how a teammate responds to that signal. Technically, these can be thought of as
constraints on the policy to enact the convention. Examples of human conventions in Hanabi
will be discussed in Section 2.

3
focused on this challenge, which we refer to as the self-play setting. However, it
is impractical for human Hanabi players to expect others to strictly abide by a
preordained policy. Not only would it be hard to convey and memorize nuanced
policies, but humans also routinely play with ad-hoc teams that may have players
of different skill levels. Even without agreeing on comprehensive conventions,
humans are still able to play successfully in this ad-hoc setting. Unfortunately,
unlike the aforementioned two-player zero-sum setting, good policies for self-
play have no meaningful worst-case performance guarantee in the ad-hoc team
setting.
Humans approach Hanabi differently than current learning algorithms. In-
stead of searching for conventions in advance to produce complete policies, hu-
mans seem to establish conventions to search for their next action, relying on
others to reason about their intent and view of the game to infer salient signals.
These basic conventions for search tend to be relatively intuitive: ensure the
next player has enough information to make a useful move or has an informa-
tion token available so they can provide a hint. Humans combine these basic
conventions for search with pragmatic reasoning, enabling more effective use of
grounded messages (e.g. implicitly signalling a card is playable). Pragmatic
reasoning thus represents an implicit convention for how the vocabulary of the
game’s grounded messages should be used and interpreted when referring to a
specific card. Humans also rely on theory of mind for Hanabi’s more advanced
“finesse” plays: these plays often appear irrational on the surface, which makes
them vulnerable to misinterpretation, in order to cue players to infer additional
information by reasoning about what the acting player could have observed that
would make such a move rational. Relying on simple conventions for search,
pragmatic reasoning, and theory of mind makes it possible for humans to join
an ad-hoc team where complete policies are not agreed upon and memorized in
advance. This makes theory of mind central to Hanabi’s gameplay, particularly
in such an ad-hoc team setting.
The combination of cooperation, imperfect information, and limited com-
munication make Hanabi ideal for investigating the importance of incorporating
theory of mind into AI agents. Whether it is for efficiency of communication or
for artifical agents to better understand and collaborate with human partners,
machine learning practitioners will need new techniques for designing agents
capable of reasoning about others. In Section 2 we describe the details of the
game and how humans approach it. In Section 3 we elaborate on the challenge it
poses for artificial intelligence and why theory of mind is expected to be critical
for effective play. We facilitate agent implementation and experimentation by
providing an open source code framework (Section 4.1). Furthermore, to pro-
mote consistent and detailed empirical evaluation in future research, we propose
an evaluation benchmark for both the self-play (Section 4.2) and ad-hoc (Sec-
tion 4.3) team settings. We evaluate the performance of current state-of-the-art
reinforcement learning methods in Section 5. Our results show that although
these learning techniques can achieve reasonable performance in self-play, they
generally fall short of the best known hand-coded agents (Section 5.3). More-
over, we show that these techniques tend to learn extremely brittle policies that

4
are unreliable for ad-hoc teams (Section 5.4). These results suggest that there
is still substantial room for technical advancements in both the self-play and ad-
hoc settings, especially as the number of players increases. Finally, we highlight
connections to prior work in Section 6.

2. Hanabi: The Game

Hanabi is a game for two to five players, best described as a type of coop-
erative solitaire. Each player holds a hand of four cards (or five, when playing
with two or three players). Each card depicts a rank (1 to 5) and a colour (red,
green, blue, yellow, and white); the deck (set of all cards) is composed of a
total of 50 cards, 10 of each colour: three 1s, two 2s, 3s, and 4s, and finally a
single 5. The goal of the game is to play cards so as to form five consecutively
ordered stacks, one for each colour, beginning with a card of rank 1 and ending
with a card of rank 5. What makes Hanabi special is that, unlike most card
games, players can only see their partners’ hands, and not their own.

G1 W1 Y2 G2

B R
6/8 3/3

Stacks Deck Discards

W4 Y1 G3 R1 R5 B1 R2 Y3 B5 B3 B4 B2

P0 P1 P2 P3

Figure 1: Example of a four player Hanabi game from the point of view of player 0. Player 1
acts after player 0 and so on.

In Hanabi players take turns doing one of three actions: giving a hint, playing
a card from their hand, or discarding a card. We call the player whose turn it
is the active player.
Hints. On their turn, the active player can give a hint to any other player. A
hint consists of choosing a rank or colour, and indicating to another player all
of their cards which match the given rank or colour. Only ranks and colors that
are present in the hand of the player can be hinted for. For example, in Figure 1,
the active player may tell Player 2, “Your first and third cards are red.” or
“Your fourth card is a 3.” To make the game interesting, hints are in limited
supply. The game begins with the group owning eight information tokens, one of
which is consumed every time a hint is given. If no information tokens remain,
hints cannot be given and the player must instead play or discard.
Discard. Whenever fewer than eight information tokens remain, the active
player can discard a card from their hand. The discarded card is placed face up
(along with any unsuccessfully played cards), visible to all players. Discarding

5
has two effects: the player draws a new card from the deck and an information
token is recovered.
Play. Finally, the active player may pick a card (known or unknown) from their
hand and attempt to play it. Playing a card is successful if the card is the next
in the sequence of its colour to be played. For example, in Figure 1 Player 2’s
action would be successful if they play their yellow 3 or their blue 1; in the
latter case forming the beginning of the blue stack. If the play is successful, the
card is placed on top of the corresponding stack. When a stack is completed
(the 5 is played) the players also receive a new information token (if they have
fewer than eight). The player can play a card even if they know nothing about
it; but if the play is unsuccessful, the card is discarded (without yielding an
information token) and the group loses one life, possibly ending the game. In
either circumstances, a new card is drawn from the deck.
Game Over. The game ends in one of three ways: either because the group
has successfully played cards to complete all five stacks, when three lives have
been lost, or after a player draws the last card from the deck and every player
has taken one final turn. If the game ends before three lives are lost, the group
scores one point for each card in each stack, for a maximum of 25; otherwise,
the score is 0.

2.1. Basic Strategy


There are too few information tokens to provide complete information (i.e.,
the rank and colour) for each of the 25 cards that can be played through only
the grounded information revealed by hints7 . While the quantity of information
provided by a hint can be improved by revealing information about multiple
cards at once, the value of information in Hanabi is very context dependent.
To maximize the team’s score at the end of the game, hints need to be selected
based on more than just the quantity of information conveyed. For example in
Figure 1, telling Player 3 that they hold four blue cards reveals more information
than telling Player 2 that they hold a single rank-1 card, but lower-ranked cards
are more important early on, as they can be played immediately. A typical game
therefore begins by hinting to players which cards are 1s, after which those
players play those cards; this both “unlocks” the ability to play the same-colour
2s and makes the remaining 1s of that colour useful for recovering information
tokens as players can discard the redundant cards.
Players are incentivized to avoid unsuccessful plays in two ways: first, losing
all three lives results in the game immediately ending with zero points; second,
the card itself is discarded. Generally speaking, discarding all cards of a given
rank and colour is a bad outcome, as it reduces the maximum achievable score.

7 Since the deck contains 50 cards, 25 of which players aim to play, and when the deck

runs out the players will always hold at least 10 cards that cannot recover usable information
tokens, this means at most 15 cards can be discarded to recover information tokens. Combined
with the eight initial information tokens, and the four recovered through completing colour
stacks, this means players can hint at most 27 times during the game (and fewer times as the
number of players increases).

6
For example, in Figure 1 both green 2s have been discarded, an effective loss
of four points as no higher rank green cards will ever be playable. As a result,
hinting to players that are at risk of discarding the only remaining card of a
given rank and colour is often prioritized. This is particularly common for rank-
5 cards since there is only one of each colour and they often need to be held for
a long time before the card can successfully be played.

2.2. Implicit Communication


While explicit communication in Hanabi is limited to the hint actions, every
action taken in Hanabi is observed by all players and can also implicitly com-
municate information. This implicit information is not conveyed through the
impact that an action has on the environment (i.e., what happens) but through
the very fact that another player decided to take this action (i.e., why it hap-
pened). This requires that players can reason over the actions that another
player would have taken in a number of different situations, essentially reason-
ing over the intent of the agent. Human players often exploit such reasoning
to convey more information through their actions. Consider the situation in
Figure 1 and assume the active player (Player 0) knows nothing about their
own cards, and so they choose to hint to another player. One option would
be tell Player 1 about the 1s in their hand. However, that information is not
particularly actionable, as the yellow 1 is not currently playable. Instead, they
could tell Player 1 about the red card, which is a 1. Although Player 1 would
not explicitly know the card is a 1, and therefore playable, they could infer that
it is playable as there would be little reason to tell them about it otherwise,
especially when Player 2 has a blue 1 that would be beneficial to hint. They
may also infer that because Player 0 chose to hint with the colour rather than
the rank, that one of their other cards is a non-playable 1. This is an example
of the type of pragmatic reasoning that humans commonly use.
An even more effective, though also more sophisticated, tactic commonly
employed by humans is the so-called “finesse” move. To perform the finesse in
this situation, Player 0 would tell Player 2 that they have a 2. By the same
pragmatic reasoning as above, Player 2 could falsely infer that their red 2 is the
playable white 2 (since both green 2s were already discarded). Player 1 can
see Player 2’s red 2 and realize that Player 2 will make this incorrect inference
and mistakenly play the card, leading Player 1 to question why Player 0 would
have chosen this seemingly irrational hint. The only rational explanation for
the choice is that Player 1 themselves must hold the red 1 (in a predictable
position, such as the most recently drawn card) and is expected to rescue the
play. Using this tactic, Player 0 can reveal enough information to get two cards
played using only a single information token. There are many other moves that
rely on this kind of reasoning about intent to convey information (e.g., bluff,
reverse finesse) [14]. We will use finesse to broadly refer to this style of move.

7
3. Hanabi: A Challenge for Communication and Theory of Mind

Due to Hanabi’s limited communication, players cannot rely solely on the


grounded hints to perform well: players, including artificial agents, will need to
convey extra information through their actions. We argue this makes Hanabi a
novel challenge for artificial intelligence practitioners that requires algorithmic
advancements capable of learning to communicate and exploiting the theory
of mind reasoning commonly employed by humans. This challenge connects
research from several communities, including reinforcement learning, game the-
ory, and emergent communication. Section 6 has additional discussion of related
work from these communities.
One avenue to approach this problem is to establish conventions prior to play.
Abiding by such conventions enables agents to communicate information about
their observations (i.e., their perspective) beyond the grounded information
contained in hints. To illustrate, one convention (possibly a poor one) could
be that players never hint about a card unless they observe that it is playable.
It then follows that if an agent receives a hint about one of their cards, they
know that card is playable. Note that conventions can communicate information
either with or without a direct signal: if players expect some behaviour in certain
situations, then the absence of that behaviour indicates they are not in one of
these situations. For example, if players expect to be warned when they are at
risk of discarding a valuable card, like a 5, a lack of warning would indicate they
are not at risk. Conventions can also be used to relay more detailed information
through hints by encoding it into a player’s choice of colour or rank hint and
the target of their hint. In fact, this idea forms the basis of Cox and colleagues’
Hanabi agents [15], which we discuss in Section 5.2.
Conventions are obviously useful in the self-play setting, where all players
know others will act exactly as they would. Although experienced human players
develop and exploit conventions [14], humans are clearly able to play effectively
outside of the self-play setting: human teams consist of distinct individuals, each
with their own capabilities, and are often formed on the fly at online websites [4]
or social events. Notably, humans collaborate successfully without relying on
prior coordination or knowing the policy of their teammates, commonly referred
to as the ad-hoc team setting [16, 17]. Consequently, for artificial agents to
achieve truly superhuman performance in Hanabi’s collaborative setting, they
will need to master both settings.

3.1. A Challenge for Self-Play Learning


One way to learn conventions is to optimise a joint policy to maximize reward
in self-play. This “search over conventions” will naturally develop conventions
to overcome Hanabi’s information scarcity. However, such an optimisation is
difficult in its own right 8 , as Hanabi’s policy space is rife with locally optimal

8 Hanabi is an instance of a decentralized Markov decision process (DEC-MDP) since the

players jointly observe the full state of the game. Bernstein and colleagues [18] showed that

8
equilibria. One such equilibrium is the babbling equilibrium where players do
not intentionally communicate information to other players, and ignore what
other players tell them. In that case, there is no incentive for a player to start
communicating because they will be ignored, and there is no incentive to pay
attention to other players because they are not communicating.
For learning algorithms to improve over such inferior equilibria, players will
need to jointly deviate, potentially changing established (suboptimal) conven-
tions. However, to change conventions a learning algorithm needs to change
behaviour across many decisions points simultaneously, since what can be in-
ferred from a signal depends on all of the situations in which that signal is given.
Additionally, the receiver of the signal has to simultaneously change how they
respond for the overall deviation to result in a performance improvement. Pre-
vious multi-agent research has explored “cheap talk” channels as a mechanism
to communicate directly to other agents [20, 21, 22], however Hanabi’s restricted
communication structure prevents such cheap talk. Every signal affects perfor-
mance not just through what is communicated but also by its affect on the game
state itself, which creates a further challenge for learning algorithms. Section 5.3
will present experimental results on how current state of the art algorithms fair
on this challenge.

3.2. The Human Approach


Human players appear to approach Hanabi differently: using “conventions
over search” that guide how they choose their immediate action and how they
expect other players will interpret their action. Part of these “conventions over
search” come from how humans use theory of mind (ToM) to ascribe mental
states to agents, such as perspectives, beliefs, and intentions. Humans rely on
theory of mind not only to reason about the potential mental states that could
cause an agent’s behaviour, but also in forming expectations for how other
agents will interpret their behaviour using the same reasoning.
Importantly, this type of theory of mind reasoning is central to not just how
humans approach Hanabi, but more generally to how humans handle commu-
nication and multi-agent interactions. For example, when communicating with
natural language, humans exploit theory of mind through their use of pragmat-
ics: conveying meaning based on not only what was literally said, but also what
is implicated (i.e., suggested) by the speaker based on the context [23]. By re-
lying on a listener to disambiguate a speaker’s meaning through mechanisms
like pragmatic reasoning, communication can be more efficient [24]. Notably,
computational models of human pragmatic reasoning depend on the assumption
that speakers are being cooperative and have the intention of communicating
useful information [25].

solving DEC-MDPs is in the nondeterministic exponential time (NEXP) complexity class,


i.e., requiring exponential time even if P=NP. Furthermore, Baffier and colleagues show that
even a single player version of Hanabi where the player has perfect information of all cards
(including the order of cards in the deck) is NP-complete, despite removing any need to
communicate [19].

9
Analogously, when humans play Hanabi, players rely on the same mutual
assumption of communicative intent and the acting player (the “speaker”) ex-
pects other players (the “listeners”) to infer the information implicated by their
specific choice of action and the context in which it was taken. Relying on such
inference enables players to efficiently signal about the hidden state of play-
ers’ cards (as in the example of humans hinting about playable cards) despite
typically not having explicit conventions about what an agent’s actions should
communicate, let alone having access to their policy.
Theory of mind and the assumption that other agents have communicative
intent provide useful mechanisms for making communication more efficient and
form the basis for the “conventions over search” employed by humans. While
this assumption imposes conventions in a manner analogous to those learned
by the “search over conventions” approach described for self-play, they are nat-
urally limited to conventions that can be inferred with communicative intent.
As a result, humans can interpret sophisticated plays without explicit agree-
ment because theory of mind makes them “self-explaining” instead of relying
on nuanced information theoretic encoding schemes. For instance, the finesse
described in Section 2 is possible without established agreement because players
assume the hint revealing the initial two was intended to communicate useful
information. These self-explaining conventions suggest a more efficient and pos-
sibly easier-to-learn approach to the self-play learning challenge.

3.3. A Challenge for Ad-hoc Teams


The human approach suggests a second challenge presented by Hanabi —
ad-hoc teams — where agents must collaborate with an arbitrary team, pos-
sibly an established team with existing conventions. This ad-hoc team setting
is important, especially for exploring collaboration with humans, since it is un-
reasonable for an agent to impose its policy on others or for human players to
memorise entire policies before playing. In the most extreme case for ad-hoc
play, an agent would have no knowledge of the other players’ conventions or pol-
icy to rely on and they must attempt to infer them on the fly. The less extreme
case allows an agent to observe a small number of examples of the team playing
together, before playing with the team, giving the agent an opportunity to first
infer intent from examples of collaboration.
Self-play training allows agents to expect their teammates share the exact
same policy and conventions. Indeed, as we demonstrate in Section 5.4, the
resulting policies of state of the art algorithms rely heavily on this fact. As a
result they often perform poorly in the ad-hoc team setting. Human Hanabi
players provide inspiration for the ad-hoc team setting as well. In particular,
the use of theory of mind and reasoning about other players as agents with com-
municative intent may lead to artificial agents that are capable of playing with
other intentional agents as well as collaborating with humans by communicating
with them on their terms.
Both the self-play and ad-hoc settings present difficult challenges for current
artificial intelligence techniques that will require long term algorithmic innova-
tion. In the next section, we introduce an open-source learning environment

10
to facilitate research and propose an experimental framework for the research
community to evaluate algorithmic advances in both settings.

4. Hanabi: The Benchmark

We propose using Hanabi as a challenging benchmark problem for AI, with


a set of unique properties, that distinguish it from other benchmarks. It is
a multi-agent learning problem, unlike, for example, the Arcade Learning En-
vironment [26]. It is also an imperfect information game, where players have
asymmetric knowledge about the environment state, which makes the game
more like poker than chess, backgammon, or Go. The cooperative goal of Han-
abi sets it further apart from all of these other challenge problems, which have
players competing against each other. As mentioned above, this combination of
partial observability and cooperative rewards creates unique challenges around
the learning of communication protocols and policies. Moreover, unlike sig-
nalling games [27] the communication in Hanabi does not use a separate chan-
nel, but rather mixes communication and environment actions. Finally, the
resulting coordination and communication problem in Hanabi was designed to
be challenging to human players.
The previous section suggests two different challenges presented by Hanabi.
The first, easier, problem is to learn a fixed policy for all players in self-play. In
this case, the learning process is in control of all players, and the objective is to
maximise the expected utility of the resulting joint policy. The second problem,
ad-hoc team play, is learning to play with a set of unknown partners, within a
human timescale of a few games rather than millions of offline interactions.
In the spirit of the proposed evaluation protocols [28] for the Arcade Learning
Environment, which discuss recommendations for learning in Atari games, we
will make some recommendations on how research should be carried out under
each challenge.

4.1. Open Source Environment


To help promote consistent, comparable, and reproducible research results,
we have released an open source Hanabi RL environment called the Hanabi
Learning Environment.9 Written in Python and C++, the code provides an
interface similar to OpenAI Gym [29]. It includes an environment state class
which can generate observations and rewards for an agent, and can be advanced
by one step given agent actions. An agent only needs to be able to generate an
integer action, given an observation bitstring.
The default agent observation in our Hanabi environment includes card
knowledge from previous hint actions, which includes both positive and neg-
ative information (e.g., “the first card is red” also says all other cards are not
red). This removes the memory task from the challenge, but humans tend to

9 The code release is being finalized. It will appear on the DeepMind github.

11
find remembering cards to be an uninteresting distraction, and the experimental
results in Section 5 show that the game remains challenging without requiring
agents to learn how to remember card knowledge. For researchers interested in
memory, we provide the option to request a minimal observation, which does
not include this card knowledge.
For debugging purposes, the code includes an environment for a small game
with only two colors, two cards per player, three information tokens, and a single
life token. There is also a very small game with a single color.

4.2. Challenge One: Self-Play Learning


The first challenge is focused on finding a joint policy that achieves a high
expected score entirely through self-play learning. A practical advantage of the
Hanabi benchmark is that the environment is extremely lightweight, both in
terms of memory and compute requirements, and fast (around 0.1ms per turn
on a CPU). It can therefore be used as a testbed for RL methods that require a
large number of samples without causing unreasonable compute requirements.
However, developing sample efficient algorithms is also an important goal for
RL algorithms in its own right. With this in mind we propose two different
regimes for the Hanabi benchmark:
Sample limited regime (SL). In the sample limited regime, we are inter-
ested in pushing the performance of sample-efficient algorithms for learning to
play Hanabi. To that extent, we propose to limit the number of environment
steps that the agent can experience to be at most 100 million. Here environ-
ment steps count the total number of turns taken during training. If the current
episode does not end at 100 million steps, then we let the agent finish the episode
before terminating training. This regime is similar to the evaluation scheme for
Atari 2600 games proposed by [28]. The 100 million step limit was chosen based
on the learning curves of the Rainbow agent presented in Section 5 to make sure
that the current state-of-the-art agents can achieve a decent score in the given
amount of time.
Unlimited regime (UL). In the unlimited regime there are no restrictions
on the amount of time or compute. The unlimited regime describes research
where the focus is on asymptotic performance, such as achieving high perfor-
mance using large-scale computation. However, we encourage all work on Han-
abi to include numbers about the compute requirements and run-time of their
methods alongside the final results.
For every k-player game (where k ∈ {2, 3, 4, 5}), we recommend the following
details be reported. Here the best agent is the training run with the highest
average score under test conditions at the end of training, e.g., when disabling
exploration and picking the greedy action.
• The training curves for all random number generator seeds, highlighting
the best agent.
• The histogram of game scores for the best agent and the percentage of
perfect games.

12
• The mean and standard deviation of the performance of the best agent,
computed as an average performance across at least 1000 trials (i.e., full
games). In the future, as performance increments become smaller this
number should be increased to allow for significant results.
As we show in Section 5, Hanabi is difficult for current learning algorithms.
Even when using a large amount of data and computation time (UL regime),
learning agents have trouble approaching the performance of hand-crafted rules
in four player games, and fall far short of hand-crafted rules for three and five
players.

4.3. Challenge Two: Ad-hoc Teams


The second challenge Hanabi poses is that of ad-hoc team play. The ultimate
goal is agents that are capable of playing with other agents or even human
players. For this, a policy which achieves a high score in self-play is of little use
if it must be followed exactly by teammates. Good strategies are not unique,
and a robust player must learn to recognize intent in other agents’ actions and
adapt to a wide range of possible strategies.
We propose to evaluate ad-hoc team performance by measuring an agent’s
ability to play with a wide range of teammates it has never encountered before.
This is measured via the score achieved by the agent when it is paired with
teammates chosen from a held-out pool of agents. 10 The composition of that
pool should be such that the players exhibit diverse strategies, which can be
hard-coded or learned by self-play.
We recommend the evaluated agents be given ten random self-play games
of its ad-hoc teammates prior to play. While other alternatives may be more
challenging (e.g., ten “warm-up” games, or average performance in the first ten
games with no prior information), this focuses the challenge on an agent’s ability
to recognize intent in other agents’ behaviour, as they can observe examples of
the intended properly coordinated behaviour.
We recommend that the mean and standard deviation of the performance
be reported across at least 1000 trials for each hold-out team. Specifically, each
trial for a particular hold-out team should be evaluated by giving the agent a
set of ten self-play games for the team, followed by the agent playing a single
game in place of a player from the hold-out team in a random position, and
finally resetting the agent so it does not retain memory across trials. The 1000
trials should also involve at least 100 different random sets of self-play games
provided to the agent. These results should be reported along with mean and

10 For two-player games forming a team is straightforward: we pair the evaluated agent

with a random player from the pool and randomly permute their order in the team. For
three to five-player games, the presence of more than one player from the pool is a potential
confounding factor: it could be that the team fails because of the interaction between these
players and through no fault of the evaluated agent. We therefore recommend to limit teams
to two unique players — the evaluated agent, in one seat, and one agent from the pool —
which is replicated in the remaining seats.

13
standard deviations of the performance of the hold-out teams in self-play (as a
baseline comparison). We further recommend crosstables of the hold-out teams’
performance when paired with each other as a method of assessing the diversity
of the pool (e.g., see Figure 5).
In the future we expect to see canonical agent pools of pre-trained or hard-
coded self-play agents be made available for training and hold-out sets to allow
for consistent comparisons.

5. Hanabi: State of the Art

Hanabi presents interesting multi-agent learning challenges for both discov-


ering conventions and adapting to an ad-hoc team of players. In this section, we
provide empirical evidence that both tasks are challenging for state-of-the-art
learning algorithms, even with an abundance of computational resources. We
examine the performance of three multi-agent reinforcement learning algorithms
using deep learning for function approximation, and contrast them with a few
of the best known hand-coded Hanabi “bots”.

5.1. Learning Agents


Actor-Critic-Hanabi-Agent. The family of asynchronous advantage actor-
critic algorithms [30] demonstrates stability, scalability and good performance
on a range of single-agent tasks, including the suite of games from the Ar-
cade Learning Environment [26], the TORCS driving simulator [31], and 3D
first-person environments [32]. In the original implementation, the policy is
represented by a deep neural network, which also learns a value function to
act as a baseline for variance reduction. Experience is accrued in parallel by
several copies of the agent running in different instantiations of the environ-
ment. Learning gradients are passed back to a centralized server which holds
the parameters for a deep neural network.
Since the environment instantiations and the server interact asynchronously,
there is a potential for the learning gradients to become stale, which impacts neg-
atively on performance. ACHA uses the Importance Weighted Actor-Learner
variant to address the stale gradient problem [33] by adjusting the stale off-
policy updates using the V-trace algorithm. The variant has been successfully
applied to the multi-agent task of Capture-the-Flag, achieving human-level per-
formance [34]. ACHA also incorporates population-based training [35], provid-
ing automatic hyperparameter optimization.
For our experiments, ACHA was ran with a population size of 30 to 50 per
run, 100 actors generating experience in parallel, and hyperparameter evolu-
tion over the learning rate and entropy regularisation weight. ACHA also used
parameter-sharing across the different players in combination with an agent-
specific ID that is part of the input. Parameter sharing is a standard method
which increases learning speed, while the agent-specific ID allows for some level
of specialisation between agents. Our neural architecture consisted of the fol-
lowing. All observations were first processed by an MLP with a single 256-unit

14
hidden layer and ReLU activations, then fed into a 2-layer LSTM with 256 units
in each layer. The policy π was a softmax readout of the LSTM output, and
the baseline was a learned linear readout of the LSTM output. We refer to this
method as the Actor-Critic-Hanabi-Agent (ACHA).
Rainbow-Agent Rainbow [36] is a state of the art agent architecture for
deep RL on the Arcade Learning Environment. It combines some of the key
innovations that have been made to Deep Q-Networks (DQN) [37] over the last
few years, resulting in a learning algorithm that is both sample efficient and
achieves high rewards at convergence. In our benchmark we use a multi-agent
version of Rainbow, based on the Dopamine framework [38]. In our code the
agents controlling the different players share parameters. Our Rainbow agent is
feedforward and does not use any observation stacking outside of the last action,
which is included in the current observation.
Our Rainbow agent uses a 2-layer MLP of 512 hidden units each to predict
value distributions using distributional reinforcement learning [39]. Our batch
size, i.e., the number of experience tuples sampled from the replay buffer per
update, is 32, and we anneal the  of our -greedy policy to 0 over the course
of the first 1000 training steps. We use a discount factor, γ, of 0.99 and ap-
ply prioritized sampling [40] for sampling from the replay buffer. Finally, our
value distributions are approximated as a discrete distribution over 51 uniformly
spaced atoms.
BAD-Agent [41]. For the two player self-play setting we also include the
results of the Bayesian Action Decoder since it constitutes state-of-the-art for
the two-player unlimited regime. Rather than relying on implicit belief repre-
sentations such as RNNs, the Bayesian Action Decoder (BAD) uses a Bayesian
belief update that directly conditions on the current policy of the acting agent.
In BAD all agents track a public belief, which includes everything that is com-
mon knowledge about the cards, including the posterior that is induced from
observing the actions different agents take. BAD also explores in the space of
deterministic policies, which ensures informative posteriors while also allowing
for randomness required to explore. Further details for the BAD agent are
provided in [41].

5.2. Rule-Based Approaches


For benchmarking we provide results of a number of independently imple-
mented rule-based strategies. These provide examples of the quality of play that
can be achieved in Hanabi.
SmartBot [42]. SmartBot is a rule-based agent that tracks the publicly known
information about each player’s cards. Tracking public knowledge allows Smart-
Bot to reason about what other players may do, and what additional knowledge
it gains from its specific view of the game. Among other things, this enables
SmartBot to play/discard cards that its partners don’t know it knows are safe
to play/discard, thereby preventing partners from wasting a hint to signal as
much. However, this tracking assumes all other players are using SmartBot’s
convention. When this assumption does not hold, as in the ad-hoc team setting,
SmartBot can fall into false or impossible beliefs. For example, SmartBot can

15
believe one of its cards has no valid value as all possible cards are inconsistent
with the observed play according to SmartBot’s convention. Finally, note that
SmartBot has a parameter specifying if it should attempt uncertain plays that
may cost a life. Risking lives increases the frequency of perfect games while
reducing average score, except in two player games where it is better on both
criteria. Our SmartBot results only risks lives in the two player setting.
HatBot [15] and WTFWThat [43]. HatBot uses a technique often seen in
coding theory and “hat puzzles”. When giving hints, HatBot uses a predefined
protocol to determine a recommended action for all other players (i.e., play or
discard for one of a player’s cards). This joint recommendation is then encoded
by summing the indices for the individual recommendations and using modular
arithmetic. The encoded joint recommendation is mapped to different hints
that HatBot could make, specifically, whether it reveals the color or rank of a
card for each other player. Since each player can view everyone’s cards but their
own, they can reconstruct the action recommended to them by figuring out what
would have been recommended to the other players based on HatBot’s conven-
tion, which HatBot assumes they know and use. Although this convention is
not very intuitive for human players, it would still be possible for humans to
learn and follow. Cox and colleagues also introduce an “information strategy”
using a similar encoding mechanism to directly convey information about each
player’s cards (as opposed to a recommended action), however it requires addi-
tional bookkeeping that makes it impractical for humans to use. As originally
proposed, both the recommendation and information strategies were tailored for
playing 5-player Hanabi. However, a variant of the information strategy, called
WTFWThat [43], can play two through five players.
FireFlower [44]. FireFlower implements a set of human-style conventions.
The bot keeps track of both private and common knowledge, including proper-
ties of cards that are implied by the common knowledge of what the conventions
entail. Using this, FireFlower performs a 2-ply search over all possible actions
with a modeled probability distribution over what its partner will do in response,
and chooses the action that maximizes the expected value of an evaluation func-
tion. The evaluation function takes into account the physical state of the game
as well as the belief state. For example, if there is a card in the partner’s hand
that is commonly known (due to inference from earlier actions) to be likely to
be playable, then the evaluation function’s output will be much higher if it is
indeed playable than if it is not. FireFlower implements a few additional con-
ventions for three and four players, but avoids the hat-like strategies in favor
of conventions that potentially allow it to partner more naturally with humans
assuming the human players exactly follow its strategy. Also, according to the
creator, FireFlower is designed with a focus on maximising the win probability,
rather than average score. Further details are in Appendix A.1.

5.3. Experimental Results: Self-Play


Table 1 shows the experimental results of the baseline agents and state-of-
the-art learning algorithms with each number of players. First, note that except
for BAD in the two-player setting, the best learning agent does not reach the

16
Regime Agent 2P 3P 4P 5P
– SmartBot 22.99 (0.00) 23.12 (0.00) 22.19 (0.00) 20.25 (0.00)
29.6% 13.8% 2.076% 0.0043%
– FireFlower 22.56 (0.00) 21.05 (0.01) 21.78 (0.01) -
52.6% 40.2% 26.5%
– HatBot – – – 22.56 (0.06)
14.7 %
– WTFWThat 19.45 (0.03) 24.20 (0.01) 24.83 (0.01) 24.89 (0.00)
0.28% 49.1% 87.2% 91.5%
SL Rainbow 20.64 (0.22) 18.71 (0.20) 18.00 (0.17) 15.26 (0.18)
2.5 % 0.2% 0% 0%
UL ACHA 22.73 (0.12) 20.24 (0.15) 21.57 (0.12) 16.80 (0.13)
15.1% 1.1% 2.4% 0%
UL BAD 23.92 (0.01) – – –
58.6%

Table 1: Shown are the results for the three learning agents, Rainbow, ACHA and BAD,
compared to the rule-based agents, SmartBot, FireFlower and HatBot, for different numbers
of players in self-play. For each algorithm and number of players we show mean performance
of the best agent followed by (standard error of the mean) and percentage of perfect (i.e., 25
point) games. Error of the mean differs based on different number of evaluation games.

performance of the best hand-coded agents. The information-theoretic strategy


based on Cox et al. achieves scores over 24 points when playing with three to
five players (including a remarkable 24.89 points and 91.5% perfect games in the
five-player game), showing a large performance gap in what is possible and what
state-of-the-art learning algorithms achieve. Even rule-based strategies aimed
at attempting to codify more human-like conventions achieve scores higher than
the learning algorithms, particularly in the three and five player setting.
The ACHA agent in the unlimited regime (using over 20 billion steps of
experience) achieved higher scores than Rainbow (using 100 million steps of
experience) across all number of players. This quite naturally may be due to
less training experience, but may also be due to Rainbow’s feed-forward network
architecture with no history of past actions possibly making it harder to learn
multi-step conventions. Both agents saw a decline in performance as the number
of agents increased with Rainbow’s more gradual, while ACHA saw a precipitous
drop in performance with five players.
Figure 2 shows training curves for one run of ACHA showing the performance
of the multiple “agents” within the population. Note that these curves are linked
together through parameter evolution and are not independent. In all but the
four player setting, it is clear ACHA has found a local minimum in policy space
it is unable to escape even with further training. However, parameter evolution
is hiding the true extent of the local minima problem. Figure 3 gives ACHA
training curves for two and four player games when evolution is disabled, so each
curve is an independent self-play trial, using the same fixed hyper-parameters
found as the best in the earlier experiment. Here we can see that independent
learning trials find a wide array of different local minima, most of which are
difficult to escape. For example, in the two player setting, roughly a third of the
independent agents are below 15 points and appear to no longer be improving.

17
25 0.25
2 players
Mean score = 22.73
Median score = 23.0
20 0.20 s.d. = 3.68
n = 1000

Proportion of games
15 0.15
Score

10 0.10

5 0.05

0 0.00
0.0 0.5 1.0 1.5 2.0 0 5 10 15 20 25
Steps 1e10 Points

25 0.20
3 players
Mean score = 20.24
Median score = 21.0
20 s.d. = 4.73
0.15
n = 1000

Proportion of games
15
Score

0.10

10

0.05
5

0 0.00
0.0 0.5 1.0 1.5 2.0 0 5 10 15 20 25
Steps 1e10 Points

25 0.25
4 players
Mean score = 21.57
Median score = 22.0
20 0.20 s.d. = 3.68
n = 1000
Proportion of games

15 0.15
Score

10 0.10

5 0.05

0 0.00
0.0 0.5 1.0 1.5 2.0 0 5 10 15 20 25
Steps 1e10 Points

25 0.20
5 players
Mean score = 16.80
Median score = 17.0
20 s.d. = 4.06
0.15
n = 1000
Proportion of games

15
Score

0.10

10

0.05
5

0 0.00
0.0 0.5 1.0 1.5 2.0 0 5 10 15 20 25
Steps 1e10 Points

Figure 2: ACHA results for two to five players, from top to bottom respectively. Performance
curves (left) are training-time results from the current soft-max policy. Average scores and
distributions (right) are test-time results from 1000 episodes generated using the greedy policy
from the agent with the best training score.

18
25 25

20 20

15 15
Score

Score
10 10

5 5

0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Steps 1e10 Steps 1e10

Figure 3: ACHA results without evolution, using 50 independent runs for two players (left),
and 30 independent runs for four players (right).

We further find that even ACHA runs with similar final performance can
learn very different conventions. For example, one agent uses color hints to
indicate that the 4th card of the teammate can likely be discarded, while another
agent uses the different color hints to indicate which card of the teammate can
be played. Different agents also use the rank-hints to indicate playability of
cards in different slots. Details examining specific examples of learned policies
are in Appendix A.2.
In contrast to ACHA, the Rainbow agents exhibit low variance across dif-
ferent independent training runs, as shown by the learning curves in Figure 4.
In this case each line represents an independent training trial. In addition,
Rainbow agents tend to converge to similar strategies, seemingly identifying the
same local minima. In particular, Rainbow agents are 3-4 times less likely to
hint for color than ACHA, and when they do there is no evidence of specific
conventions associated with the color hinted. Instead all Rainbow agents we
looked at primarily hint for rank, typically indicating that the most recent card
is playable. See Appendix A.2 for details. We speculate that two factors con-
tribute to this consistency across different runs. First, Rainbow has a one-step
memory for the past action and no memory of past observations. This drasti-
cally reduces the space of possible strategies. Second, Rainbow is a value-based
method and starts out with a high exploration rate. During the initial explo-
ration, since agents fail to successfully finish any games without running out of
lives, Q-values will tend towards zero. This might cause agents to learn from
effectively the same starting Q-values, even across different independent runs.

5.4. Experimental Results: Ad-Hoc Team Play


The policies learned by the aforementioned ACHA, BAD and Rainbow agents
are moderately effective in self-play. We now investigate these agents in the
ad-hoc team setting, where teammates are using different conventions. In par-
ticular, we examine the performance of different ad-hoc teams of ACHA and
Rainbow agents. Since we established in Section 5.3 that Rainbow agents learn
nearly identical strategies across different runs, for the rest of the section we

19
25 0.20
2 players
Mean score = 20.64
Median score = 21.0
20 s.d. = 6.93
0.15
n = 1000

Proportion of games
15
Score

0.10

10

0.05
5

0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25
Steps 1e8 Points

25 0.20
3 players
Mean score = 18.71
Median score = 19.0
20 s.d. = 6.18
0.15
n = 1000

Proportion of games
15
Score

0.10

10

0.05
5

0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25
Steps 1e8 Points

25 0.20
4 players
Mean score = 18.00
Median score = 18.0
20 s.d. = 5.32
0.15
n = 1000
Proportion of games

15
Score

0.10

10

0.05
5

0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25
Steps 1e8 Points

25 0.25
5 players
Mean score = 15.26
Median score = 16.0
20 0.20
s.d. = 5.73
n = 1000
Proportion of games

15 0.15
Score

10 0.10

5 0.05

0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25
Steps 1e8 Points

Figure 4: Rainbow results for two to five players, from top to bottom respectively. Performance
curves (left) are training-time results from the current policy. Average scores and distributions
(right) are test-time results from 1000 episodes generated using the agent with the best training
score.  in -greedy for all agents is set to zero.

20
consider a population of agents made up of the best performing Rainbow agent
and a pool of independently trained ACHA agents taken from the top ten agents
according to final training-time performance from Figure 3.
As neither of these algorithms make use of sample play of their ad-hoc team-
mates, the ad-hoc team’s performance will simply depend on the compatibility
of the different protocols. The purpose of this section, then, is primarily to
illustrate the difficulty in ad-hoc team play, and suggest a source for creating a
diverse pool of agents for future evaluation.
Two Players. Figure 5a shows a crosstable of the agents’ test performance
in the ad-hoc setting. The entry for the i-th row and j-th column shows the
mean performance of agent i playing in an ad-hoc team consisting of agent j,
evaluated for 1000 games with a random player starting each game. The scores
of 20 or greater along the diagonal entries show that the agents indeed perform
well in self-play. However, when paired with other agents, performance drops off
sharply, with some agents scoring essentially zero in any of these ad-hoc teams.
25.0 25.0
1 22.84 2.40 0.01 2.03 1.42 1.42 1.11 0.10 1.75 1.60 1.32 3.27 1 20.47 2.92 5.83 3.26 0.90 0.83 7.28 3.00 0.45 4.13 4.91
22.5 22.5
2 2.40 22.12 0.01 1.56 1.09 1.83 0.93 0.02 1.10 1.62 0.98 3.06
2 2.27 20.64 0.94 0.50 0.24 0.91 4.10 0.11 0.29 5.05 3.50
20.0 20.0
3 0.01 0.01 21.67 0.08 0.18 0.00 0.01 0.00 0.05 0.09 0.01 2.01
3 5.04 1.06 19.94 1.89 2.59 0.45 0.45 0.01 0.73 2.47 3.46
4 2.03 1.56 0.08 21.72 3.43 1.58 1.13 0.04 1.12 1.82 1.84 3.31
17.5 17.5
4 2.58 0.53 1.40 19.73 0.17 2.59 1.05 0.11 0.43 1.50 3.01
5 1.42 1.09 0.18 3.43 21.24 1.71 1.55 0.02 1.98 1.58 2.03 3.29 15.0 15.0
5
Agent ID

Agent ID

1.07 0.29 2.44 0.14 19.28 0.34 0.24 0.04 1.01 1.80 2.67
Score

Score
6 1.42 1.83 0.00 1.58 1.71 20.93 1.63 0.00 1.24 1.88 1.38 3.05 12.5 12.5
6 1.06 1.65 0.92 3.57 1.08 19.12 0.67 0.07 3.19 2.38 3.37
7 1.11 0.93 0.01 1.13 1.55 1.63 20.74 0.00 1.54 1.22 0.92 2.80
10.0 10.0
7 4.30 5.94 0.59 1.23 0.16 0.48 18.97 0.04 0.21 2.48 3.44
8 0.10 0.02 0.00 0.04 0.02 0.00 0.00 20.70 0.07 0.02 0.00 1.91
7.5 7.5
9 1.75 1.10 0.05 1.12 1.98 1.24 1.54 0.07 20.98 1.33 1.17 2.94
8 1.73 0.67 0.03 0.13 0.08 0.06 0.06 18.01 0.09 0.33 2.12

5.0 5.0
10 1.60 1.62 0.09 1.82 1.58 1.88 1.22 0.02 1.33 22.48 1.88 3.23 9 0.64 0.22 1.04 0.85 1.43 2.92 0.36 0.08 17.83 1.10 2.65

2.5 10 2.5
R 1.32 0.98 0.01 1.84 2.03 1.38 0.92 0.00 1.17 1.88 20.52 2.91 5.24 4.77 3.58 2.71 2.68 1.70 2.39 0.30 1.48 18.42 4.33

1 2 3 4 5 6 7 8 9 10 R Avg 0.0 1 2 3 4 5 6 7 8 9 10 Avg 0.0


Team ID Team ID

(a) Ad-hoc results for two players. (b) Ad-hoc results for four players.

Figure 5: Ad-hoc team results. Teams were constructed using the 10 best independently
trained ACHA agents (1–10; see Figure 3) and the best Rainbow agent (R; see Figure 4).
Mean scores over 1000 trials are shown. Standard error of the mean was less than 0.09 and
0.255 for two and four players, respectively.

Four Players. We observe analogous results when evaluating ad-hoc teams in


the four player setting. We used the top ten ACHA agents from Figure 3. Similar
to the two player ad-hoc results, the entry for the i-th row and j-th column of
Figure 5b shows the mean performance of agent i when playing with an ad-hoc
team consisting of three other agent j players, evaluated for 1000 games with
a random player starting each game. As in the two player results, the agents
fare relatively well in self-play but performance dramatically decreases once we
introduce a second unique agent to the team.
Developing agents that can learn from, adapt, and play well with unknown
teammates represents a formidable challenge.

21
6. Hanabi: Related Work

In this section, we discuss prior work on the game of Hanabi. We also briefly
discuss relevant work in multi-agent reinforcement learning.

6.1. Prior Work on Hanabi AI


To the best of our knowledge, the earliest published work on Hanabi was
in 2015. Osawa described some of the unique elements of Hanabi, and showed
that simulated strategies that try to recognize the intention of the other players
performed better than a fixed set of static strategies in two-player Hanabi [45].
Later in the same year, Cox et al. developed the hat strategy described in
Section 5.2 [15].
Several studies focused on techniques to achieve strong Hanabi play. First,
van den Bergh et al. described fixed rules whose trigger thresholds were opti-
mized by manual tuning via simulation [46]; the action selection is also improved
during play using Monte Carlo planning. Several rule-based agents and Monte
Carlo tree search players were evaluated in [47]: again, here, a predictor version
that modeled the other players was found to perform better than bots with-
out this ability. Finally, the best-performing heuristic players in their study
combined search with the hat strategy to improve the frequency of the highest
scores [48]. In the five-player game when ignoring the rule in that hints cannot
refer to an empty set of cards, they reported achieving 24.92 points and a perfect
score 92% of the time on average. While some of these ideas are used in the
agents that we describe in Section 5.2, the agents we benchmark have superior
performance under the complete Hanabi rules to the numbers reported in these
papers.
There are two related works that are not about (directly) increasing perfor-
mance. The first was on the complexity of the generalized game, which found
optimal gameplay in Hanabi to be NP-hard even with a centralised, cheating
player playing all seats and knowing every hand [19]. The second describes how
to model knowledge and how it is revealed through actions using dynamic epis-
temic logic [49]. These epistemic formalizations within game theory attempt to
quantify what players know and assume others know, how players can reason
about this knowledge, and what rationality means in this context [50]. These
ideas are also reflected in the BAD agent [41], where they are combined with
scalable deep multi-agent RL, and have resulted in the strongest two player
Hanabi agent.
Recently, there was a Hanabi competition that took place at the IEEE Com-
putational Intelligence in Games Conference in Maastricht [51]. There were a
total of five agents, three submissions and two samples . There were two tracks,
called “Mirror” and “Mixed” which broadly match the two categories we pro-
pose in Section 4. Similarly to van den Bergh, the second-place player used a
genetic algorithm to evolve a sequence of rules from a fixed rule set. This agent
achieved an average score of 17.71 points in the Mirror competition, while the
first place agent, “Monte Carlo NN”, achieved a score of 20.56. According to

22
the website these scores are averaged across the results for two to five players.
No further detail on this agent was available on the website.

6.2. Reinforcement Learning (RL) and Multi-Agent RL


While there has been much work in rule-based and search-based players using
various heuristics based on domain knowledge, we are unaware of any prior
approaches on learning to play Hanabi directly from experience given only the
rules of the game. The framework for this approach is reinforcement learning,
where an agent chooses actions in its environment, receives observations and
rewards [52]. The goal is to learn a policy that achieves a high expected sum of
rewards, i.e. score.
The setting with multiple reinforcement learning agents was first investi-
gated in competitive games [53]. This was the start of the foundational work on
multiagent reinforcement learning (MARL), which focused on algorithms and
convergence properties in Markov/stochastic (simultaneous move) games. In-
dependent learners in the cooperative case, even in these simpler game models,
already face several coordination problems [54]. Several years of work on MARL
gave rise to many algorithms, extensions, and analyses [55, 56, 57, 58, 59]. For
a more recent survey on multi-agent deep reinforcement learning, see [60].
Model-free methods, which learn a direct mapping from observations to ac-
tions, are appealing in traditional RL because the agent need not understand
the dynamics of the environment in order to act. In games, however, the per-
fect model (i.e., the rules) is given. This leads to methods than can combine
planning with RL, as first demonstrated by TD-Leaf [61] and TreeStrap [62].
Recently, Monte Carlo tree search [63] was combined with deep neural networks
in AlphaGo [64], a computer Go-playing agent that was stronger than any pre-
ceding program and defeated the best human Go players. Shortly thereafter
AlphaGo was surpassed by AlphaGo Zero, which used less knowledge [65], and
then AlphaZero which also learned to play chess and shogi at super-human
levels [66].
Many successes above were in applications of RL to perfect information
games (e.g., Go, chess, shogi). Partially-observable environments offer new
challenges since the world cannot be perfectly simulated: agents must reason
about information that is not known to them. In the competitive case, such as
Poker, strong play required new algorithmic foundations [67, 68, 11, 10].
Finally, there has been recent interest in the problem of emergent commu-
nication. These problems include an arbitrary or structured communication
channel, and agents must learn to communicate to solve a cooperative prob-
lem. Algorithms have been developed to learn to solve riddles and referential
games [20, 69], gridworld games requiring coordination [70], object identifica-
tion via question-and-answer dialog [71], and negotiation [21, 22]. The main
difference in Hanabi is that there is no “cheap-talk” communication channel:
any signalling must be done through the actions played. It is therefore more
similar in spirit to learning how to relay information to partners in the bidding
round of bridge [72].

23
6.3. Beliefs, Bayesian Reasoning, and Theory of Mind
There has been much work that proposes to model beliefs about the in-
tentions or plans of other agents. Several formalisms have been proposed for
this, such as I-POMDPs [73] and Bayesian games [74]. However, algorithms to
compute exact solutions quickly become intractable for complex problems.
Classical models to predict human behavior in games include ways to deal
with imperfect rationality [75, 76]. In recent years, several mechanisms have
been proposed to learn these models using deep learning, from human data in
one-shot games [77], learning players with expert features [78], and end-to-end
prediction of fixed policies in gridworld games [2]. When predictions are used to
exploit or coordinate, this is generally called opponent/teammate modeling [79].
Several MARL algorithms have been recently proposed that learn from mod-
els of others, by using a training regime based on cognitive hierarchies [80], by
defining a learning rule that shapes the anticipated learning of other agents [81],
or by training architectures that incorporate agent identification as part of the
objective and conditioning the policy on these predictions [82]. Lastly, an ap-
proach in poker (DeepStack) introduced function approximators that represent
belief-conditional values and explicitly representing and reasoning about manip-
ulating agents’ beliefs [10], similar to the approach taken in BAD [41].

7. Conclusion

The combination of cooperative gameplay and imperfect information make


Hanabi a compelling research challenge for machine learning techniques in multi-
agent settings. We evaluate state-of-the-art reinforcement learning algorithms
using deep neural networks and demonstrate that they are largely insufficient
to surpass current hand-coded bots when evaluated in a self-play setting. Fur-
thermore, we show that in the ad-hoc team setting, where agents must play
with unknown teammates, such techniques fail to collaborate at all. Theory of
mind appears to play an important role in Hanabi, not only for the efficiency
and interpretability of agent communication, but also for agents to be capable
of collaborating with human partners that rely on such reasoning. This sug-
gests that there is considerable room to improve in both settings and that novel
techniques will be required to close this gap, hopefully advancing our under-
standing of how our agents can develop something like a theory of mind. To
promote effective and consistent comparison between techniques, we provide a
new open source code framework for Hanabi and propose evaluation methodol-
ogy for practitioners.

Acknowledgements

We would like to thank many people: Matthieu d’Epenoux of Cocktail


Games and Antoine Bauza, who designed Hanabi, for their support on this
project; Alden Christianson for help with coordinating across three different

24
time zones, and discussions with Cocktail Games; Angeliki Lazaridou for discus-
sion and feedback on pragmatics and theory of mind; Ed Lockhart for many dis-
cussions on small Hanabi-like games and belief-based reasoning; Daniel Toyama
for assistance with writing clear, readable code for the Hanabi research environ-
ment used in our experiments; Kevin Waugh for helpful comments and feedback;
David Wu for providing details on the FireFlower bot; and Danny Tarlow for
early feedback on the project.

References

[1] D. Premack, G. Woodruff, Does the chimpanzee have a theory of mind?,


Behavioral and Brain Sciences 1 (1978) 515–526.

[2] N. C. Rabinowitz, F. Perbet, H. F. Song, C. Zhang, S. M. A. Eslami,


M. Botvinick, Machine theory of mind, CoRR abs/1802.07740. arXiv:
1802.07740.
URL http://arxiv.org/abs/1802.07740
[3] D. C. Dennett, The Intentional Stance, The MIT Press, Cambridge, MA,
1987.
[4] Board Game Arena, Board game arena: Play board games online!,
https://en.boardgamearena.com/#!gamepanel?game=hanabi, [Online;
accessed 5-December-2018] (2018).
[5] Keldon, http://keldon.net/hanabi/ (2018).

[6] M. Campbell, A. J. Hoane Jr, F.-h. Hsu, Deep blue, Artificial intelligence
134 (1-2) (2002) 57–83.
[7] J. Schaeffer, R. Lake, P. Lu, M. Bryant, Chinook the world man-machine
checkers champion, AI Magazine 17 (1) (1996) 21.

[8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den


Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot,
S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lilli-
crap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the
game of Go with deep neural networks and tree search, Nature 529 (7587)
(2016) 484–489.
[9] G. Tesauro, Temporal difference learning and TD-Gammon, Communica-
tions of the ACM 38 (3).
[10] M. Moravčı́k, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis,
K. Waugh, M. Johanson, M. Bowling, Deepstack: Expert-level artificial
intelligence in heads-up no-limit poker, Science 358 (6362).
[11] N. Brown, T. Sandholm, Superhuman AI for heads-up no-limit poker: Li-
bratus beats top professionals, Science 360 (6385).

25
[12] N. Burch, M. Johanson, M. Bowling, Solving imperfect information games
using decomposition, in: Proceedings of the Twenty-Eighth AAAI Confer-
ence on Artificial Intelligence, 2014, pp. 602–608.
URL http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/
8407

[13] N. Burch, Time and space: Why imperfect information games are hard,
Ph.D. thesis, University of Alberta, Alberta, Canada (2017).
[14] J. Zamiell, Github - zamiell/hanabi-conventions: Hanabi conven-
tions for the hyphen-ated group, https://github.com/Zamiell/
hanabi-conventions, [Online; accessed 5-December-2018] (2018).

[15] C. Cox, J. D. Silva, P. Deorsey, F. H. J. Kenter, T. Retter, J. Tobin, How to


make the perfect fireworks display: Two strategies for Hanabi, Mathematics
Magazine 88 (5) (2015) 323–336.
[16] M. Bowling, P. McCracken, Coordination and adaptation in impromptu
teams, in: Proceedings of the Twentieth National Conference on Artificial
Intelligence (AAAI), 2005, pp. 53–58.
[17] P. Stone, G. A. Kaminka, S. Kraus, J. S. Rosenschein, Ad-hoc autonomous
agent teams: Collaboration without pre-coordination, in: Proceedings of
the Twenty-Fourth Conference on Artificial Intelligence, 2010.

[18] D. S. Bernstein, R. Givan, N. Immerman, S. Zilberstein, The complexity of


decentralized control of markov decision processes, Math. Oper. Res. 27 (4)
(2002) 819–840. doi:10.1287/moor.27.4.819.297.
URL https://doi.org/10.1287/moor.27.4.819.297
[19] J.-F. Baffier, M.-K. Chiu, Y. Diez, M. Korman, V. Mitsou, A. Renssen,
M. Roeloffzen, Y. Uno, Hanabi is NP-hard, even for cheaters who look at
their cards, Theoretical Computer Science 675 (2017) 43–55.
[20] J. N. Foerster, Y. M. Assael, N. de Freitas, S. Whiteson, Learning
to communicate with deep multi-agent reinforcement learning, CoRR
abs/1605.06676. arXiv:1605.06676.
URL http://arxiv.org/abs/1605.06676
[21] M. Lewis, D. Yarats, Y. Dauphin, D. Parikh, D. Batra, Deal or no deal?
end-to-end learning of negotiation dialogues, in: In Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing, 2017,
pp. 2433–2443, association for Computational Linguistics.

[22] K. Cao, A. Lazaridou, M. Lanctot, J. Z. Leibo, K. Tuyls, S. Clark, Emer-


gent communication through negotiation, CoRR abs/1804.03980. arXiv:
1804.03980.
URL http://arxiv.org/abs/1804.03980

26
[23] H. P. Grice, Logic and conversation, in: P. Cole, J. L. Morgan (Eds.),
Syntax and semantics, Vol. 3, New York: Academic Press, 1975.
[24] S. T. Piantadosi, H. Tily, E. Gibson, The communicative func-
tion of ambiguity in language, Cognition 122 (3) (2012) 280 – 291.
doi:https://doi.org/10.1016/j.cognition.2011.10.004.
URL http://www.sciencedirect.com/science/article/pii/
S0010027711002496
[25] M. C. Frank, N. D. Goodman, Predicting pragmatic reasoning in lan-
guage games, Science 336 (6084) (2012) 998–998. arXiv:http://
science.sciencemag.org/content/336/6084/998.full.pdf, doi:10.
1126/science.1218633.
URL http://science.sciencemag.org/content/336/6084/998
[26] M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, The Arcade Learn-
ing Environment: An evaluation platform for general agents, Journal of
Artificial Intelligence Research 47 (2013) 253–279.
[27] D. Lewis, Convention: A philosophical study, John Wiley & Sons, 2008.
[28] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht,
M. Bowling, Revisiting the Arcade Learning Environment: Evaluation pro-
tocols and open problems for general agents, Journal of Artificial Intelli-
gence Research 61 (2018) 523–562.
[29] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,
J. Tang, W. Zaremba, Openai gym, CoRR abs/1606.01540. arXiv:
1606.01540.
URL http://arxiv.org/abs/1606.01540
[30] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement
learning, in: Proceedings of the 33rd International Conference on Machine
Learning (ICML), 2016, pp. 1928–1937.
[31] B. Wymann, C. Dimitrakakis, A. Sumner, E. Espié, C. Guionneau, Torcs :
The open racing car simulator (2015).
[32] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler,
A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson,
S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis,
S. Legg, S. Petersen, Deepmind lab, CoRR abs/1612.03801. arXiv:1612.
03801.
URL http://arxiv.org/abs/1612.03801
[33] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,
Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, K. Kavukcuoglu, IM-
PALA: scalable distributed deep-rl with importance weighted actor-learner
architectures, CoRR abs/1802.01561.

27
[34] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G.
Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman,
N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis,
K. Kavukcuoglu, T. Graepel, Human-level performance in first-person mul-
tiplayer games with population-based deep reinforcement learning, CoRR
abs/1807.01281.
[35] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue,
A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando,
K. Kavukcuoglu, Population based training of neural networks, CoRR
abs/1711.09846.
[36] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney,
D. Horgan, B. Piot, M. G. Azar, D. Silver, Rainbow: Combining improve-
ments in deep reinforcement learning, CoRR abs/1710.02298.
[37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al., Human-level control through deep reinforcement learning, Nature
518 (7540) (2015) 529–533.
[38] P. S. Castro, S. Moitra, C. Gelada, S. Kumar, M. G. Bellemare, Dopamine:
A research framework for deep reinforcement learning, arXiv.
[39] M. G. Bellemare, W. Dabney, R. Munos, A distributional perspective on
reinforcement learning, in: Proceedings of the International Conference on
Machine Learning, 2017.
[40] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized experience replay,
in: International Conference on Learning Representations, 2016.
[41] J. N. Foerster, F. Song, E. Hughes, N. Burch, I. Dunning, S. Whiteson,
M. Botvinick, M. Bowling, Bayesian action decoder for deep multi-agent
reinforcement learning, CoRR abs/1811.01458.
[42] A. O’Dwyer, Github - quuxplusone/hanabi: Framework for writing bots
that play hanabi, https://github.com/Quuxplusone/Hanabi, [Online;
accessed 19-September-2018] (2018).
[43] J. Wu, Github - wuthefwasthat/hanabi.rs: Hanabi simulation in rust,
https://github.com/WuTheFWasThat/hanabi.rs, [Online; accessed 29-
November-2018] (2018).
[44] D. Wu, Github - lightvector/fireflower: A rewrite of hanabi-bot in
scala, https://github.com/lightvector/fireflower, [Online; accessed
3-December-2018] (2018).
[45] H. Osawa, Solving Hanabi: Estimating hands by opponent’s actions in
cooperative game with incomplete information, in: Proceedings on the 2015
AAAI Workshop on Computer Poker and Imperfect Information, 2015, pp.
37–43.

28
[46] M. J. H. van den Bergh, A. Hommelberg, W. A. Kosters, F. M. Spieksma,
Aspects of the cooperative card game Hanabi, in: BNAIC 2016: Artificial
Intelligence. 28th Benelux Conference on Artificial Intelligence, Vol. 765 of
CCIS, Springer, Cham, 2016, pp. 93–105.
[47] J. Walton-Rivers, P. R. Williams, R. Bartle, D. Perez-Liebana, S. M. Lucas,
Evaluating and modelling Hanabi-playing agents, in: Proceedings of the
IEEE Conference on Evolutionary Computation (2017), 2017.
[48] B. Bouzy, Playing hanabi near-optimally, in: Advances in Computer
Games: ACG 2017, Vol. 10664 of LNCS, Springer, Cham, 2017, pp. 51–62.
[49] M. Eger, C. Martens, Practical specification of belief manipulation in
games, in: Proceedings, The Thirteenth AAAI Conference on Artificial
Intelligence and Interactive Digital Entertainment (AIIDE-17), 2017.
[50] E. Pacuit, O. Roy, Epistemic foundations of game theory, in: E. N. Zalta
(Ed.), The Stanford Encyclopedia of Philosophy, summer 2017 Edition,
Metaphysics Research Lab, Stanford University, 2017.
[51] J. Walton-Rivers, P. Williams, CIG 2018 Hanabi fireworks com-
petition, http://hanabi.aiclash.com/, https://project.dke.
maastrichtuniversity.nl/cig2018/competitions/, retrieved Sept.
27th, 2018 (2018).
[52] R. Sutton, A. Barto, Reinforcement Learning: An Introduction, 2nd Edi-
tion, MIT Press, 2018.
[53] M. L. Littman, Markov games as a framework for multi-agent reinforcement
learning, in: In Proceedings of the Eleventh International Conference on
Machine Learning, Morgan Kaufmann, 1994, pp. 157–163.
[54] L. Matignon, G. J. Laurent, N. L. Fort-Piat, Independent reinforcement
learners in cooperative Markov games: a survey regarding coordination
problems, The Knowledge Engineering Review 27 (01) (2012) 1–31.
[55] M. Bowling, M. Veloso, Simultaneous adversarial multi-robot learning, in:
IJCAI, Vol. 3, 2003, pp. 699–704.
[56] L. Panait, S. Luke, Cooperative multi-agent learning: The state of the
art, Autonomous Agents and Multi-Agent Systems 11 (3) (2005) 387–434.
doi:10.1007/s10458-005-2631-2.
URL http://dx.doi.org/10.1007/s10458-005-2631-2
[57] L. Busoniu, R. Babuska, B. D. Schutter, A comprehensive survey of mul-
tiagent reinforcement learning, IEEE Transaction on Systems, Man, and
Cybernetics, Part C: Applications and Reviews 38 (2) (2008) 156–172.
[58] A. Nowé, P. Vrancx, Y.-M. D. Hauwere, Game theory and multi-agent
reinforcement learning, in: Reinforcement Learning: State-of-the-Art,
Springer, 2012, Ch. 14, pp. 441–470.

29
[59] D. Bloembergen, K. Tuyls, D. Hennes, M. Kaisers, Evolutionary dynamics
of multi-agent learning: A survey, J. Artif. Intell. Res. (JAIR) 53 (2015)
659–697.
[60] P. Hernandez-Leal, B. Kartal, M. E. Taylor, Is multiagent deep rein-
forcement learning the answer or the question? a brief survey, CoRR
abs/1810.05587. arXiv:1810.05587.
URL http://arxiv.org/abs/1810.05587
[61] J. Baxter, A. Tridgell, L. Weaver, Knightcap: a chess program that learns
by combining TD(lambda) with game-tree search, in: Proceedings of the
15th International Conference on Machine Learning, Morgan Kaufmann,
1998, pp. 28–36.
[62] J. Veness, D. Silver, A. Blair, W. Uther, Bootstrapping from game tree
search, in: Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams,
A. Culotta (Eds.), Advances in Neural Information Processing Systems 22,
Curran Associates, Inc., 2009, pp. 1937–1945.

[63] L. Kocsis, C. Szepesvári, Bandit based Monte-Carlo planning, 2006, pp.


282–293.
[64] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot,
S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lilli-
crap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the
game of Go with deep neural networks and tree search., Nature 529 (2016)
484–489.
[65] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, T. H.
Arthur Guez, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui,
L. Sifre, G. van den Driessche, T. Graepel, D. Hassabis, Mastering the
game of Go without human knowledge, Nature 550 (2017) 354–359.
[66] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,
M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan,
D. Hassabis, Mastering chess and shogi by self-play with a general reinforce-
ment learning algorithm, CoRR abs/1712.01815. arXiv:1712.01815.
URL http://arxiv.org/abs/1712.01815
[67] M. Zinkevich, M. Johanson, M. Bowling, C. Piccione, Regret minimization
in games with incomplete information, in: Advances in Neural Information
Processing Systems 20 (NIPS 2007), 2008.

[68] M. Bowling, N. Burch, M. Johanson, O. Tammelin, Heads-up Limit


Hold’em Poker is solved, Science 347 (6218) (2015) 145–149.
[69] A. Lazaridou, A. Peysakhovich, M. Baroni, Multi-agent cooperation and
the emergence of (natural) language, CoRR abs/1612.07182. arXiv:1612.

30
07182.
URL http://arxiv.org/abs/1612.07182
[70] S. Sukhbaatar, A. Szlam, R. Fergus, Learning multiagent communication
with backpropagation, CoRR abs/1605.07736. arXiv:1605.07736.
URL http://arxiv.org/abs/1605.07736
[71] H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, A. Courville,
Guesswhat?! visual object discovery through multi-modal dialogue, in:
IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[72] C. Yeh, H. Lin, Automatic bridge bidding using deep reinforcement learn-
ing, CoRR abs/1607.03290. arXiv:1607.03290.
URL http://arxiv.org/abs/1607.03290
[73] P. J. Gmytrasiewicz, P. Doshi, A framework for sequential planning in
multi-agent settings, JAIR 24.
[74] Y. Shoham, K. Leyton-Brown, Multiagent Systems: Algorithmic, Game-
Theoretic, and Logical Foundations, Cambridge University Press, 2009.
[75] W. Yoshida, R. J. Dolan, K. J. Friston, Game theory of mind, PLOS 4 (12).
[76] J. R. Wright, K. Leyton-Brown, Predicting human behavior in unrepeated,
simultaneous-move games, Games and Economic Behavior 106 (2017) 16–
37.
[77] J. Hartford, J. R. Wright, K. Leyton-Brown, Deep learning for predict-
ing human strategic behavior, in: Thirtieth Annual Conference on Neural
Information Processing Systems (NIPS 2016), 2016.
[78] H. He, J. Boyd-Graber, K. Kwok, H. D. III, Opponent modeling in deep
reinforcement learning, in: Proceedings of the International Conference on
Machine Learning (ICML), 2016.
[79] S. V. Albrecht, P. Stone, Autonomous agents modelling other agents: A
comprehensive survey and open problems, Artificial Intelligence 258 (2018)
66–95.
[80] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pero-
lat, D. Silver, T. Graepel, A unified game-theoretic approach to multiagent
reinforcement learning, in: Advances in Neural Information Processing Sys-
tems, 2017.
[81] J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, I. Mor-
datch, Learning with opponent-learning awareness, CoRR abs/1709.04326.
arXiv:1709.04326.
URL http://arxiv.org/abs/1709.04326
[82] A. Grover, M. Al-Shedivat, J. K. Gupta, Y. Burda, H. Edwards, Learn-
ing policy representations in multiagent systems, in: Proceedings of the
International Conference on Machine Learning (ICML), 2018.

31
Appendix A.

Appendix A.1. FireFlower Details


Implemented conventions include the following:

• Hints generally indicate play cards, and generally newer cards first.
• Hints “chain” on to other hints, e.g., if A hints to B a playable red 2 as
red, then B might infer that it is indeed the red 2, and then hint back
to A a red 3 in A’s hand as red, expecting A to believe it is a red 3 to
play it after B plays its red 2.
• When discarding, discard provably useless cards, otherwise the oldest “un-
protected” card.
• Hints about the oldest unprotected card “protect” that card, with many
exceptions where it instead means play.
• Hints about garbage cards indicate “protect” cards older than it.
• Deliberately discarding known playable cards signals to the partner that
they have a copy of that card (with enough convention-based specificity
on location that they may play it with absolutely no actual hints).
• In many cases, hints about cards that have already been hinted change
the belief about those cards in various ways such that the partner will
likely change what they do (in accordance with the broader heuristic that
one usually should give a hint if and only if they would like to change the
partner’s future behavior).

Appendix A.2. Conditional Probability Tables


The tables below show conditional probability summaries of the learned
policies from different runs of ACHA and Rainbow in the two player game. See
Section 5 for a discussion of these policy summaries.

32
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 4 4 3 1 10 2 1 1 0 1 1 6 2 2 1 12 14 11 12 11
2 2 6 5 2 12 1 0 0 0 0 1 4 2 2 1 14 18 13 11 5
3 2 6 11 2 13 1 1 1 1 2 1 3 2 2 1 18 17 11 4 1
4 1 2 5 4 19 0 1 0 0 0 1 6 1 6 2 18 14 10 6 2
5 1 1 1 0 30 1 0 0 0 0 0 3 2 2 2 10 12 16 13 6
Play 1 4 5 3 1 18 4 2 1 1 2 0 2 1 1 1 13 10 10 11 9
2 4 6 5 1 22 3 2 1 0 1 0 2 1 1 0 11 12 11 11 6
3 4 5 2 1 26 4 2 2 1 2 0 3 1 1 1 7 6 8 14 10
4 3 3 4 0 0 3 5 1 1 1 0 3 0 0 0 49 10 1 12 3
5 2 4 5 1 26 2 1 0 0 0 0 3 1 1 0 12 14 12 10 5
Hint R 6 31 25 9 3 0 0 0 0 1 9 4 2 5 2 1 2 0 1 1

33
Colour Y 19 31 12 5 8 1 0 1 0 0 1 7 2 3 1 2 3 2 2 2
G 20 30 18 1 9 2 0 1 0 0 0 4 3 1 1 0 2 2 2 2
W 18 18 16 10 15 0 0 0 0 0 1 4 1 8 0 2 2 1 2 1
B 12 8 11 6 37 2 0 0 0 0 0 2 2 4 3 1 1 3 3 4
Hint 1 3 3 6 8 1 8 9 0 8 50 0 0 0 0 0 3 0 1 0 0
Rank 2 3 6 1 0 0 8 23 0 0 52 0 0 0 0 0 3 1 0 0 0
3 1 5 2 0 1 11 20 10 0 47 0 0 0 0 0 0 1 1 0 0
4 1 4 3 1 1 13 22 6 1 37 0 0 0 0 0 6 2 1 1 0
5 2 4 0 0 2 29 19 7 2 28 0 1 0 0 0 0 1 1 1 2
P% (at ) 3 5 4 2 14 5 7 2 1 17 0 2 1 1 1 9 8 7 7 4

Table A.2: Conditional action probabilities for a first two player Rainbow agent.
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 4 1 1 1 22 1 1 1 0 1 1 2 3 2 0 10 21 9 10 7
2 2 3 1 0 19 1 1 1 1 1 2 2 3 3 0 9 14 19 11 9
3 4 1 6 1 19 2 0 0 0 2 3 3 4 3 3 21 11 9 2 5
4 3 1 1 2 29 1 0 1 0 0 2 3 3 6 2 12 15 11 5 4
5 1 1 0 0 32 1 0 0 0 0 0 3 2 4 0 10 13 14 11 5
Play 1 4 2 1 1 24 3 2 2 0 3 1 1 1 1 0 11 12 10 11 10
2 4 2 1 1 28 3 2 1 0 2 0 1 1 1 0 12 14 12 8 7
3 4 2 1 0 23 4 2 1 0 4 1 2 1 2 0 11 13 7 12 9
4 3 2 0 0 8 4 2 1 0 6 0 0 1 1 0 35 23 5 5 4
5 3 1 1 1 33 2 2 1 0 1 0 2 2 1 0 11 15 12 8 5
Hint R 15 23 8 9 5 1 0 0 0 0 11 0 7 4 4 3 1 1 1 5

34
Colour Y 28 15 11 9 13 1 1 0 0 1 1 4 3 1 0 2 2 3 2 3
G 22 19 5 4 17 1 1 1 0 1 3 2 8 2 0 2 1 2 2 6
W 9 7 2 4 60 1 1 1 0 1 2 2 3 4 0 1 1 1 1 2
B 17 7 21 17 4 1 0 0 1 5 7 0 0 1 10 1 4 1 2 0
Hint 1 2 3 2 6 1 9 11 6 5 54 0 0 0 0 0 0 0 0 0 0
Rank 2 8 2 1 0 1 7 21 0 0 48 0 0 0 0 0 8 3 0 0 0
3 2 2 1 0 1 12 18 11 0 48 0 0 0 0 0 2 1 1 0 0
4 1 1 0 0 1 24 14 9 1 43 0 0 0 0 0 0 0 1 1 1
5 8 4 4 2 2 15 19 7 2 23 0 1 2 0 0 2 1 1 3 3
P% (at ) 4 2 1 1 19 5 7 3 1 17 0 1 2 2 0 8 9 8 6 4

Table A.3: Conditional action probabilities for second two player Rainbow agent.
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 9 1 2 1 15 1 1 0 0 0 3 2 2 0 0 19 19 11 9 5
2 8 4 2 1 16 1 0 0 0 0 6 2 2 1 1 16 20 9 8 2
3 6 1 2 0 24 1 1 0 0 1 3 3 2 0 1 12 16 14 7 4
4 7 1 2 2 27 1 1 0 1 1 3 3 2 1 1 14 13 10 7 3
5 2 0 1 0 32 1 0 0 0 0 3 3 3 0 0 11 12 14 13 4
Play 1 9 1 2 1 21 4 3 1 1 2 1 1 0 0 1 10 12 11 12 8
2 4 1 2 1 25 5 5 1 1 3 1 3 1 0 1 6 7 9 13 12
3 8 2 2 1 22 2 2 1 1 3 1 1 0 0 0 17 13 10 10 4
4 5 1 1 2 2 6 3 2 1 8 0 1 0 0 0 33 13 1 15 3
5 7 1 1 1 29 2 1 1 0 1 2 2 1 0 0 12 15 11 10 4
Hint R 49 15 20 3 4 0 0 0 0 0 0 1 0 0 1 0 3 1 0 1

35
Colour Y 35 5 8 3 12 3 4 1 1 0 1 4 2 0 3 2 2 2 5 7
G 12 4 7 3 55 1 1 0 0 0 1 3 3 1 2 2 1 1 2 1
W 14 13 8 9 27 0 0 0 1 0 2 0 2 12 1 0 2 4 2 1
B 31 2 13 3 12 0 2 1 1 1 2 3 3 1 14 2 1 2 2 3
Hint 1 4 2 5 8 2 10 0 9 6 47 0 0 0 0 0 4 0 1 0 0
Rank 2 9 0 1 0 0 17 0 15 1 48 0 0 0 0 0 4 2 0 0 0
3 2 1 1 0 0 14 9 19 0 50 0 0 0 0 0 1 1 1 0 0
4 3 3 2 1 1 19 14 8 1 38 0 0 0 0 0 5 3 1 2 1
5 2 1 1 0 2 31 14 8 1 30 1 1 0 0 1 1 0 1 2 3
P% (at ) 7 1 2 1 16 7 3 5 1 17 1 2 1 0 0 9 9 7 7 3

Table A.4: Conditional action probabilities for third two player Rainbow agent.
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 3 4 7 3 6 4 3 1 0 1 6 6 5 6 6 11 7 7 5 9
2 4 5 7 2 6 4 2 2 0 0 6 6 5 6 5 12 6 8 5 9
3 4 4 9 2 7 5 3 1 0 0 5 6 5 6 5 10 6 9 5 8
4 4 4 7 3 12 4 3 1 0 0 4 5 3 4 6 11 8 6 5 9
5 1 2 4 1 9 4 3 1 42 0 2 2 2 2 2 3 6 7 5 2
Play 1 3 3 5 3 6 9 5 2 0 2 4 3 4 5 4 8 8 7 6 11
2 4 4 6 4 7 8 5 1 0 1 4 4 4 5 3 9 7 8 5 11
3 3 3 4 2 7 5 3 1 0 1 5 4 4 4 4 12 15 8 4 9
4 4 4 5 4 9 7 4 1 0 1 4 5 4 5 5 9 9 9 5 5
5 5 5 8 5 10 13 7 1 0 0 3 3 3 3 3 6 7 6 4 5
Hint R 4 5 3 38 1 16 14 1 0 0 2 1 1 1 2 3 3 2 2 3

36
Colour Y 4 3 3 45 1 15 12 1 0 0 1 2 1 1 1 3 3 2 1 2
G 5 4 2 37 1 19 14 1 0 0 1 1 2 1 2 3 3 2 1 2
W 3 3 3 44 1 15 13 1 0 0 2 1 1 1 1 3 2 2 1 2
B 3 5 1 44 1 15 12 1 0 0 1 1 1 1 2 3 2 2 2 2
Hint 1 0 0 0 0 0 9 6 9 12 64 0 0 0 0 0 0 0 0 0 0
Rank 2 3 4 2 1 0 2 4 4 22 42 0 0 0 1 0 8 2 1 0 1
3 2 3 1 16 1 5 6 3 26 24 1 1 1 0 1 6 3 1 0 1
4 1 1 0 3 0 2 2 2 55 31 0 0 0 0 0 0 0 0 0 1
5 2 2 2 18 0 6 3 1 55 4 1 0 1 1 1 0 0 1 1 1
P% (at ) 3 3 4 10 5 8 5 2 10 9 3 3 2 3 3 7 6 5 3 5

Table A.5: Conditional action probabilities for first two player ACHA agent.
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 2 2 0 1 31 6 0 0 0 0 9 3 2 3 4 4 2 6 8 16
2 2 2 1 1 33 2 1 1 1 1 8 2 3 4 4 2 2 3 8 17
3 0 2 5 2 0 53 3 4 1 2 5 1 5 2 1 2 2 1 1 10
4 0 0 1 1 0 0 71 0 0 0 4 2 1 1 2 2 2 1 2 11
5 2 1 0 1 31 1 0 0 0 0 11 3 2 6 4 5 3 6 8 16
Play 1 3 3 1 1 27 6 2 1 0 2 7 3 3 3 3 7 2 5 8 12
2 4 3 1 1 27 5 2 1 0 2 6 4 2 4 3 4 3 5 9 13
3 3 2 1 1 25 4 1 1 0 1 8 4 5 4 3 8 3 5 9 12
4 4 2 2 2 31 3 1 0 0 1 7 3 3 7 4 5 3 4 8 11
5 2 1 1 1 33 3 2 0 0 1 11 4 3 5 4 5 3 5 8 9
Hint R 0 0 0 0 0 0 0 0 0 97 0 0 0 0 0 0 0 0 0 1

37
Colour Y 0 0 0 1 0 0 94 0 0 0 1 0 0 0 0 1 0 0 0 1
G 0 0 0 0 0 0 0 0 99 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 98 0 0 0 0 0 0 0 0 0 0 0
B 0 0 0 1 0 0 95 0 0 0 0 0 0 0 0 1 0 0 0 1
Hint 1 0 0 0 1 0 23 0 73 0 0 1 0 0 0 0 0 0 0 0 1
Rank 2 0 0 0 0 0 47 0 0 0 44 1 1 0 1 1 2 0 1 0 1
3 28 4 1 1 5 43 0 1 0 1 1 1 1 1 1 2 1 2 3 1
4 11 32 1 2 9 17 0 3 0 0 1 2 2 3 2 5 1 3 4 2
5 0 0 0 0 0 0 0 0 0 97 0 0 0 0 0 0 0 0 0 1
P% (at ) 3 3 1 1 19 6 6 3 5 15 6 2 2 3 3 4 2 4 6 8

Table A.6: Conditional action probabilities for second two player ACHA agent.

You might also like