Poker Hanabi
Poker Hanabi
Abstract
From the early days of computing, games have been important testbeds for
studying how well machines can do sophisticated decision making. In recent
years, machine learning has made dramatic advances with artificial agents reach-
ing superhuman performance in challenge domains like Go, Atari, and some
variants of poker. As with their predecessors of chess, checkers, and backgam-
mon, these game domains have driven research by providing sophisticated yet
well-defined challenges for artificial intelligence practitioners. We continue this
tradition by proposing the game of Hanabi as a new challenge domain with novel
problems that arise from its combination of purely cooperative gameplay and
imperfect information in a two to five player setting. In particular, we argue
that Hanabi elevates reasoning about the beliefs and intentions of other agents
to the foreground. We believe developing novel techniques capable of imbuing
artificial agents with such theory of mind will not only be crucial for their success
in Hanabi, but also in broader collaborative efforts, and especially those with
human partners. To facilitate future research, we introduce the open-source
Hanabi Learning Environment, propose an experimental framework for the re-
search community to evaluate algorithmic advances, and assess the performance
of current state-of-the-art techniques.
Keywords: multi-agent learning, challenge paper, reinforcement learning,
games, theory of mind, communication, imperfect information, cooperative
∗ Equal Contribution
Email addresses: nbard@google.com (Nolan Bard), jakob.foerster@cs.ox.ac.uk
(Jakob N. Foerster), chandar@google.com (Sarath Chandar), burchn@google.com (Neil
Burch), lanctot@google.com (Marc Lanctot), songf@google.com (H. Francis Song),
eparisot@cs.cmu.edu (Emilio Parisotto), vdumoulin@google.com (Vincent Dumoulin),
smoitra@google.com (Subhodeep Moitra), edwardhughes@google.com (Edward Hughes),
iaindunning@gmail.com (Iain Dunning), shibl@google.com (Shibl Mourad),
hugolarochelle@google.com (Hugo Larochelle), bellemare@google.com (Marc G.
Bellemare), bowlingm@google.com (Michael Bowling)
1 University of Oxford, work done at DeepMind
2 DeepMind
3 Google Brain
4 Carnegie Mellon University, work done at Google Brain
1. Introduction
5 Dennett uses the phrase intentional stance to refer to this “strategy” for explanation and
prediction [3].
2
rank or colour. Importantly, performing a hint action consumes the limited re-
source of information tokens, making it impossible to fully resolve each player’s
uncertainty about the cards they hold based on this grounded information alone.
To overcome this limitation, successful play involves communicating extra infor-
mation implicitly through the choice of actions themselves, which are observable
by all players. In Section 2 we provide a more detailed description of the game
and what human gameplay looks like.
Hanabi is different from the adversarial two-player zero-sum games where
computers have reached super-human skill, e.g., chess [6], checkers [7], go [8],
backgammon [9] and two-player poker [10, 11]. In those games, agents typically
compute an equilibrium policy (or equivalently, a strategy) such that no sin-
gle player can improve their utility by deviating from the equilibrium. While
two-player zero-sum games can have multiple equilibria, different equilibria are
interchangeable: each player can play their part of different equilibrium profiles
without impacting their utility. As a result, agents can achieve a meaningful
worst-case performance guarantee in these domains by finding any equilibrium
policy. However, since Hanabi is neither (exclusively) two-player nor zero-sum,
the value of an agent’s policy depends critically on the policies used by its team-
mates. Even if all players manage to play according to the same equilibrium,
there can be multiple locally optimal equilibria that are relatively inferior. For
algorithms that iteratively train independent agents, such as those commonly
used in the multi-agent reinforcement learning literature, these inferior equilib-
ria can be particularly difficult to escape.
The presence of imperfect information in Hanabi creates another challenging
dimension of complexity for AI algorithms. As has been observed in domains like
poker, imperfect information entangles how an agent should behave across mul-
tiple observed states [12, 13]. In Hanabi, this manifests in the communication
protocol 6 between players, where the efficacy of any given protocol depends on
the entire scheme rather than how players communicate in a particular observed
situation. That is, how the other players will respond to a chosen signal will de-
pend upon what other situations use the same signal. Due to this entanglement,
the type of single-action exploration techniques common in reinforcement learn-
ing (e.g., -greedy, entropy regularization) can incorrectly evaluate the utility of
such exploration steps as they ignore their holistic impact.
Hanabi’s cooperative nature also complicates the question of what kind of
policy practitioners should seek. One challenge is to discover a policy for the
entire team that has high utility. Most of the prior research on Hanabi has
6 In pure signalling games where actions are purely communicative, policies are often re-
ferred to as communication protocols. Though Hanabi is not such a pure signalling game,
when we want to emphasize the communication properties of an agent’s policy we will still
refer to its communication protocol. We will use the word convention to refer to the parts
of a communication protocol or policy that interrelate, such as the situations when a signal
is used and how a teammate responds to that signal. Technically, these can be thought of as
constraints on the policy to enact the convention. Examples of human conventions in Hanabi
will be discussed in Section 2.
3
focused on this challenge, which we refer to as the self-play setting. However, it
is impractical for human Hanabi players to expect others to strictly abide by a
preordained policy. Not only would it be hard to convey and memorize nuanced
policies, but humans also routinely play with ad-hoc teams that may have players
of different skill levels. Even without agreeing on comprehensive conventions,
humans are still able to play successfully in this ad-hoc setting. Unfortunately,
unlike the aforementioned two-player zero-sum setting, good policies for self-
play have no meaningful worst-case performance guarantee in the ad-hoc team
setting.
Humans approach Hanabi differently than current learning algorithms. In-
stead of searching for conventions in advance to produce complete policies, hu-
mans seem to establish conventions to search for their next action, relying on
others to reason about their intent and view of the game to infer salient signals.
These basic conventions for search tend to be relatively intuitive: ensure the
next player has enough information to make a useful move or has an informa-
tion token available so they can provide a hint. Humans combine these basic
conventions for search with pragmatic reasoning, enabling more effective use of
grounded messages (e.g. implicitly signalling a card is playable). Pragmatic
reasoning thus represents an implicit convention for how the vocabulary of the
game’s grounded messages should be used and interpreted when referring to a
specific card. Humans also rely on theory of mind for Hanabi’s more advanced
“finesse” plays: these plays often appear irrational on the surface, which makes
them vulnerable to misinterpretation, in order to cue players to infer additional
information by reasoning about what the acting player could have observed that
would make such a move rational. Relying on simple conventions for search,
pragmatic reasoning, and theory of mind makes it possible for humans to join
an ad-hoc team where complete policies are not agreed upon and memorized in
advance. This makes theory of mind central to Hanabi’s gameplay, particularly
in such an ad-hoc team setting.
The combination of cooperation, imperfect information, and limited com-
munication make Hanabi ideal for investigating the importance of incorporating
theory of mind into AI agents. Whether it is for efficiency of communication or
for artifical agents to better understand and collaborate with human partners,
machine learning practitioners will need new techniques for designing agents
capable of reasoning about others. In Section 2 we describe the details of the
game and how humans approach it. In Section 3 we elaborate on the challenge it
poses for artificial intelligence and why theory of mind is expected to be critical
for effective play. We facilitate agent implementation and experimentation by
providing an open source code framework (Section 4.1). Furthermore, to pro-
mote consistent and detailed empirical evaluation in future research, we propose
an evaluation benchmark for both the self-play (Section 4.2) and ad-hoc (Sec-
tion 4.3) team settings. We evaluate the performance of current state-of-the-art
reinforcement learning methods in Section 5. Our results show that although
these learning techniques can achieve reasonable performance in self-play, they
generally fall short of the best known hand-coded agents (Section 5.3). More-
over, we show that these techniques tend to learn extremely brittle policies that
4
are unreliable for ad-hoc teams (Section 5.4). These results suggest that there
is still substantial room for technical advancements in both the self-play and ad-
hoc settings, especially as the number of players increases. Finally, we highlight
connections to prior work in Section 6.
Hanabi is a game for two to five players, best described as a type of coop-
erative solitaire. Each player holds a hand of four cards (or five, when playing
with two or three players). Each card depicts a rank (1 to 5) and a colour (red,
green, blue, yellow, and white); the deck (set of all cards) is composed of a
total of 50 cards, 10 of each colour: three 1s, two 2s, 3s, and 4s, and finally a
single 5. The goal of the game is to play cards so as to form five consecutively
ordered stacks, one for each colour, beginning with a card of rank 1 and ending
with a card of rank 5. What makes Hanabi special is that, unlike most card
games, players can only see their partners’ hands, and not their own.
G1 W1 Y2 G2
B R
6/8 3/3
W4 Y1 G3 R1 R5 B1 R2 Y3 B5 B3 B4 B2
P0 P1 P2 P3
Figure 1: Example of a four player Hanabi game from the point of view of player 0. Player 1
acts after player 0 and so on.
In Hanabi players take turns doing one of three actions: giving a hint, playing
a card from their hand, or discarding a card. We call the player whose turn it
is the active player.
Hints. On their turn, the active player can give a hint to any other player. A
hint consists of choosing a rank or colour, and indicating to another player all
of their cards which match the given rank or colour. Only ranks and colors that
are present in the hand of the player can be hinted for. For example, in Figure 1,
the active player may tell Player 2, “Your first and third cards are red.” or
“Your fourth card is a 3.” To make the game interesting, hints are in limited
supply. The game begins with the group owning eight information tokens, one of
which is consumed every time a hint is given. If no information tokens remain,
hints cannot be given and the player must instead play or discard.
Discard. Whenever fewer than eight information tokens remain, the active
player can discard a card from their hand. The discarded card is placed face up
(along with any unsuccessfully played cards), visible to all players. Discarding
5
has two effects: the player draws a new card from the deck and an information
token is recovered.
Play. Finally, the active player may pick a card (known or unknown) from their
hand and attempt to play it. Playing a card is successful if the card is the next
in the sequence of its colour to be played. For example, in Figure 1 Player 2’s
action would be successful if they play their yellow 3 or their blue 1; in the
latter case forming the beginning of the blue stack. If the play is successful, the
card is placed on top of the corresponding stack. When a stack is completed
(the 5 is played) the players also receive a new information token (if they have
fewer than eight). The player can play a card even if they know nothing about
it; but if the play is unsuccessful, the card is discarded (without yielding an
information token) and the group loses one life, possibly ending the game. In
either circumstances, a new card is drawn from the deck.
Game Over. The game ends in one of three ways: either because the group
has successfully played cards to complete all five stacks, when three lives have
been lost, or after a player draws the last card from the deck and every player
has taken one final turn. If the game ends before three lives are lost, the group
scores one point for each card in each stack, for a maximum of 25; otherwise,
the score is 0.
7 Since the deck contains 50 cards, 25 of which players aim to play, and when the deck
runs out the players will always hold at least 10 cards that cannot recover usable information
tokens, this means at most 15 cards can be discarded to recover information tokens. Combined
with the eight initial information tokens, and the four recovered through completing colour
stacks, this means players can hint at most 27 times during the game (and fewer times as the
number of players increases).
6
For example, in Figure 1 both green 2s have been discarded, an effective loss
of four points as no higher rank green cards will ever be playable. As a result,
hinting to players that are at risk of discarding the only remaining card of a
given rank and colour is often prioritized. This is particularly common for rank-
5 cards since there is only one of each colour and they often need to be held for
a long time before the card can successfully be played.
7
3. Hanabi: A Challenge for Communication and Theory of Mind
players jointly observe the full state of the game. Bernstein and colleagues [18] showed that
8
equilibria. One such equilibrium is the babbling equilibrium where players do
not intentionally communicate information to other players, and ignore what
other players tell them. In that case, there is no incentive for a player to start
communicating because they will be ignored, and there is no incentive to pay
attention to other players because they are not communicating.
For learning algorithms to improve over such inferior equilibria, players will
need to jointly deviate, potentially changing established (suboptimal) conven-
tions. However, to change conventions a learning algorithm needs to change
behaviour across many decisions points simultaneously, since what can be in-
ferred from a signal depends on all of the situations in which that signal is given.
Additionally, the receiver of the signal has to simultaneously change how they
respond for the overall deviation to result in a performance improvement. Pre-
vious multi-agent research has explored “cheap talk” channels as a mechanism
to communicate directly to other agents [20, 21, 22], however Hanabi’s restricted
communication structure prevents such cheap talk. Every signal affects perfor-
mance not just through what is communicated but also by its affect on the game
state itself, which creates a further challenge for learning algorithms. Section 5.3
will present experimental results on how current state of the art algorithms fair
on this challenge.
9
Analogously, when humans play Hanabi, players rely on the same mutual
assumption of communicative intent and the acting player (the “speaker”) ex-
pects other players (the “listeners”) to infer the information implicated by their
specific choice of action and the context in which it was taken. Relying on such
inference enables players to efficiently signal about the hidden state of play-
ers’ cards (as in the example of humans hinting about playable cards) despite
typically not having explicit conventions about what an agent’s actions should
communicate, let alone having access to their policy.
Theory of mind and the assumption that other agents have communicative
intent provide useful mechanisms for making communication more efficient and
form the basis for the “conventions over search” employed by humans. While
this assumption imposes conventions in a manner analogous to those learned
by the “search over conventions” approach described for self-play, they are nat-
urally limited to conventions that can be inferred with communicative intent.
As a result, humans can interpret sophisticated plays without explicit agree-
ment because theory of mind makes them “self-explaining” instead of relying
on nuanced information theoretic encoding schemes. For instance, the finesse
described in Section 2 is possible without established agreement because players
assume the hint revealing the initial two was intended to communicate useful
information. These self-explaining conventions suggest a more efficient and pos-
sibly easier-to-learn approach to the self-play learning challenge.
10
to facilitate research and propose an experimental framework for the research
community to evaluate algorithmic advances in both settings.
9 The code release is being finalized. It will appear on the DeepMind github.
11
find remembering cards to be an uninteresting distraction, and the experimental
results in Section 5 show that the game remains challenging without requiring
agents to learn how to remember card knowledge. For researchers interested in
memory, we provide the option to request a minimal observation, which does
not include this card knowledge.
For debugging purposes, the code includes an environment for a small game
with only two colors, two cards per player, three information tokens, and a single
life token. There is also a very small game with a single color.
12
• The mean and standard deviation of the performance of the best agent,
computed as an average performance across at least 1000 trials (i.e., full
games). In the future, as performance increments become smaller this
number should be increased to allow for significant results.
As we show in Section 5, Hanabi is difficult for current learning algorithms.
Even when using a large amount of data and computation time (UL regime),
learning agents have trouble approaching the performance of hand-crafted rules
in four player games, and fall far short of hand-crafted rules for three and five
players.
10 For two-player games forming a team is straightforward: we pair the evaluated agent
with a random player from the pool and randomly permute their order in the team. For
three to five-player games, the presence of more than one player from the pool is a potential
confounding factor: it could be that the team fails because of the interaction between these
players and through no fault of the evaluated agent. We therefore recommend to limit teams
to two unique players — the evaluated agent, in one seat, and one agent from the pool —
which is replicated in the remaining seats.
13
standard deviations of the performance of the hold-out teams in self-play (as a
baseline comparison). We further recommend crosstables of the hold-out teams’
performance when paired with each other as a method of assessing the diversity
of the pool (e.g., see Figure 5).
In the future we expect to see canonical agent pools of pre-trained or hard-
coded self-play agents be made available for training and hold-out sets to allow
for consistent comparisons.
14
hidden layer and ReLU activations, then fed into a 2-layer LSTM with 256 units
in each layer. The policy π was a softmax readout of the LSTM output, and
the baseline was a learned linear readout of the LSTM output. We refer to this
method as the Actor-Critic-Hanabi-Agent (ACHA).
Rainbow-Agent Rainbow [36] is a state of the art agent architecture for
deep RL on the Arcade Learning Environment. It combines some of the key
innovations that have been made to Deep Q-Networks (DQN) [37] over the last
few years, resulting in a learning algorithm that is both sample efficient and
achieves high rewards at convergence. In our benchmark we use a multi-agent
version of Rainbow, based on the Dopamine framework [38]. In our code the
agents controlling the different players share parameters. Our Rainbow agent is
feedforward and does not use any observation stacking outside of the last action,
which is included in the current observation.
Our Rainbow agent uses a 2-layer MLP of 512 hidden units each to predict
value distributions using distributional reinforcement learning [39]. Our batch
size, i.e., the number of experience tuples sampled from the replay buffer per
update, is 32, and we anneal the of our -greedy policy to 0 over the course
of the first 1000 training steps. We use a discount factor, γ, of 0.99 and ap-
ply prioritized sampling [40] for sampling from the replay buffer. Finally, our
value distributions are approximated as a discrete distribution over 51 uniformly
spaced atoms.
BAD-Agent [41]. For the two player self-play setting we also include the
results of the Bayesian Action Decoder since it constitutes state-of-the-art for
the two-player unlimited regime. Rather than relying on implicit belief repre-
sentations such as RNNs, the Bayesian Action Decoder (BAD) uses a Bayesian
belief update that directly conditions on the current policy of the acting agent.
In BAD all agents track a public belief, which includes everything that is com-
mon knowledge about the cards, including the posterior that is induced from
observing the actions different agents take. BAD also explores in the space of
deterministic policies, which ensures informative posteriors while also allowing
for randomness required to explore. Further details for the BAD agent are
provided in [41].
15
believe one of its cards has no valid value as all possible cards are inconsistent
with the observed play according to SmartBot’s convention. Finally, note that
SmartBot has a parameter specifying if it should attempt uncertain plays that
may cost a life. Risking lives increases the frequency of perfect games while
reducing average score, except in two player games where it is better on both
criteria. Our SmartBot results only risks lives in the two player setting.
HatBot [15] and WTFWThat [43]. HatBot uses a technique often seen in
coding theory and “hat puzzles”. When giving hints, HatBot uses a predefined
protocol to determine a recommended action for all other players (i.e., play or
discard for one of a player’s cards). This joint recommendation is then encoded
by summing the indices for the individual recommendations and using modular
arithmetic. The encoded joint recommendation is mapped to different hints
that HatBot could make, specifically, whether it reveals the color or rank of a
card for each other player. Since each player can view everyone’s cards but their
own, they can reconstruct the action recommended to them by figuring out what
would have been recommended to the other players based on HatBot’s conven-
tion, which HatBot assumes they know and use. Although this convention is
not very intuitive for human players, it would still be possible for humans to
learn and follow. Cox and colleagues also introduce an “information strategy”
using a similar encoding mechanism to directly convey information about each
player’s cards (as opposed to a recommended action), however it requires addi-
tional bookkeeping that makes it impractical for humans to use. As originally
proposed, both the recommendation and information strategies were tailored for
playing 5-player Hanabi. However, a variant of the information strategy, called
WTFWThat [43], can play two through five players.
FireFlower [44]. FireFlower implements a set of human-style conventions.
The bot keeps track of both private and common knowledge, including proper-
ties of cards that are implied by the common knowledge of what the conventions
entail. Using this, FireFlower performs a 2-ply search over all possible actions
with a modeled probability distribution over what its partner will do in response,
and chooses the action that maximizes the expected value of an evaluation func-
tion. The evaluation function takes into account the physical state of the game
as well as the belief state. For example, if there is a card in the partner’s hand
that is commonly known (due to inference from earlier actions) to be likely to
be playable, then the evaluation function’s output will be much higher if it is
indeed playable than if it is not. FireFlower implements a few additional con-
ventions for three and four players, but avoids the hat-like strategies in favor
of conventions that potentially allow it to partner more naturally with humans
assuming the human players exactly follow its strategy. Also, according to the
creator, FireFlower is designed with a focus on maximising the win probability,
rather than average score. Further details are in Appendix A.1.
16
Regime Agent 2P 3P 4P 5P
– SmartBot 22.99 (0.00) 23.12 (0.00) 22.19 (0.00) 20.25 (0.00)
29.6% 13.8% 2.076% 0.0043%
– FireFlower 22.56 (0.00) 21.05 (0.01) 21.78 (0.01) -
52.6% 40.2% 26.5%
– HatBot – – – 22.56 (0.06)
14.7 %
– WTFWThat 19.45 (0.03) 24.20 (0.01) 24.83 (0.01) 24.89 (0.00)
0.28% 49.1% 87.2% 91.5%
SL Rainbow 20.64 (0.22) 18.71 (0.20) 18.00 (0.17) 15.26 (0.18)
2.5 % 0.2% 0% 0%
UL ACHA 22.73 (0.12) 20.24 (0.15) 21.57 (0.12) 16.80 (0.13)
15.1% 1.1% 2.4% 0%
UL BAD 23.92 (0.01) – – –
58.6%
Table 1: Shown are the results for the three learning agents, Rainbow, ACHA and BAD,
compared to the rule-based agents, SmartBot, FireFlower and HatBot, for different numbers
of players in self-play. For each algorithm and number of players we show mean performance
of the best agent followed by (standard error of the mean) and percentage of perfect (i.e., 25
point) games. Error of the mean differs based on different number of evaluation games.
17
25 0.25
2 players
Mean score = 22.73
Median score = 23.0
20 0.20 s.d. = 3.68
n = 1000
Proportion of games
15 0.15
Score
10 0.10
5 0.05
0 0.00
0.0 0.5 1.0 1.5 2.0 0 5 10 15 20 25
Steps 1e10 Points
25 0.20
3 players
Mean score = 20.24
Median score = 21.0
20 s.d. = 4.73
0.15
n = 1000
Proportion of games
15
Score
0.10
10
0.05
5
0 0.00
0.0 0.5 1.0 1.5 2.0 0 5 10 15 20 25
Steps 1e10 Points
25 0.25
4 players
Mean score = 21.57
Median score = 22.0
20 0.20 s.d. = 3.68
n = 1000
Proportion of games
15 0.15
Score
10 0.10
5 0.05
0 0.00
0.0 0.5 1.0 1.5 2.0 0 5 10 15 20 25
Steps 1e10 Points
25 0.20
5 players
Mean score = 16.80
Median score = 17.0
20 s.d. = 4.06
0.15
n = 1000
Proportion of games
15
Score
0.10
10
0.05
5
0 0.00
0.0 0.5 1.0 1.5 2.0 0 5 10 15 20 25
Steps 1e10 Points
Figure 2: ACHA results for two to five players, from top to bottom respectively. Performance
curves (left) are training-time results from the current soft-max policy. Average scores and
distributions (right) are test-time results from 1000 episodes generated using the greedy policy
from the agent with the best training score.
18
25 25
20 20
15 15
Score
Score
10 10
5 5
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Steps 1e10 Steps 1e10
Figure 3: ACHA results without evolution, using 50 independent runs for two players (left),
and 30 independent runs for four players (right).
We further find that even ACHA runs with similar final performance can
learn very different conventions. For example, one agent uses color hints to
indicate that the 4th card of the teammate can likely be discarded, while another
agent uses the different color hints to indicate which card of the teammate can
be played. Different agents also use the rank-hints to indicate playability of
cards in different slots. Details examining specific examples of learned policies
are in Appendix A.2.
In contrast to ACHA, the Rainbow agents exhibit low variance across dif-
ferent independent training runs, as shown by the learning curves in Figure 4.
In this case each line represents an independent training trial. In addition,
Rainbow agents tend to converge to similar strategies, seemingly identifying the
same local minima. In particular, Rainbow agents are 3-4 times less likely to
hint for color than ACHA, and when they do there is no evidence of specific
conventions associated with the color hinted. Instead all Rainbow agents we
looked at primarily hint for rank, typically indicating that the most recent card
is playable. See Appendix A.2 for details. We speculate that two factors con-
tribute to this consistency across different runs. First, Rainbow has a one-step
memory for the past action and no memory of past observations. This drasti-
cally reduces the space of possible strategies. Second, Rainbow is a value-based
method and starts out with a high exploration rate. During the initial explo-
ration, since agents fail to successfully finish any games without running out of
lives, Q-values will tend towards zero. This might cause agents to learn from
effectively the same starting Q-values, even across different independent runs.
19
25 0.20
2 players
Mean score = 20.64
Median score = 21.0
20 s.d. = 6.93
0.15
n = 1000
Proportion of games
15
Score
0.10
10
0.05
5
0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25
Steps 1e8 Points
25 0.20
3 players
Mean score = 18.71
Median score = 19.0
20 s.d. = 6.18
0.15
n = 1000
Proportion of games
15
Score
0.10
10
0.05
5
0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25
Steps 1e8 Points
25 0.20
4 players
Mean score = 18.00
Median score = 18.0
20 s.d. = 5.32
0.15
n = 1000
Proportion of games
15
Score
0.10
10
0.05
5
0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25
Steps 1e8 Points
25 0.25
5 players
Mean score = 15.26
Median score = 16.0
20 0.20
s.d. = 5.73
n = 1000
Proportion of games
15 0.15
Score
10 0.10
5 0.05
0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25
Steps 1e8 Points
Figure 4: Rainbow results for two to five players, from top to bottom respectively. Performance
curves (left) are training-time results from the current policy. Average scores and distributions
(right) are test-time results from 1000 episodes generated using the agent with the best training
score. in -greedy for all agents is set to zero.
20
consider a population of agents made up of the best performing Rainbow agent
and a pool of independently trained ACHA agents taken from the top ten agents
according to final training-time performance from Figure 3.
As neither of these algorithms make use of sample play of their ad-hoc team-
mates, the ad-hoc team’s performance will simply depend on the compatibility
of the different protocols. The purpose of this section, then, is primarily to
illustrate the difficulty in ad-hoc team play, and suggest a source for creating a
diverse pool of agents for future evaluation.
Two Players. Figure 5a shows a crosstable of the agents’ test performance
in the ad-hoc setting. The entry for the i-th row and j-th column shows the
mean performance of agent i playing in an ad-hoc team consisting of agent j,
evaluated for 1000 games with a random player starting each game. The scores
of 20 or greater along the diagonal entries show that the agents indeed perform
well in self-play. However, when paired with other agents, performance drops off
sharply, with some agents scoring essentially zero in any of these ad-hoc teams.
25.0 25.0
1 22.84 2.40 0.01 2.03 1.42 1.42 1.11 0.10 1.75 1.60 1.32 3.27 1 20.47 2.92 5.83 3.26 0.90 0.83 7.28 3.00 0.45 4.13 4.91
22.5 22.5
2 2.40 22.12 0.01 1.56 1.09 1.83 0.93 0.02 1.10 1.62 0.98 3.06
2 2.27 20.64 0.94 0.50 0.24 0.91 4.10 0.11 0.29 5.05 3.50
20.0 20.0
3 0.01 0.01 21.67 0.08 0.18 0.00 0.01 0.00 0.05 0.09 0.01 2.01
3 5.04 1.06 19.94 1.89 2.59 0.45 0.45 0.01 0.73 2.47 3.46
4 2.03 1.56 0.08 21.72 3.43 1.58 1.13 0.04 1.12 1.82 1.84 3.31
17.5 17.5
4 2.58 0.53 1.40 19.73 0.17 2.59 1.05 0.11 0.43 1.50 3.01
5 1.42 1.09 0.18 3.43 21.24 1.71 1.55 0.02 1.98 1.58 2.03 3.29 15.0 15.0
5
Agent ID
Agent ID
1.07 0.29 2.44 0.14 19.28 0.34 0.24 0.04 1.01 1.80 2.67
Score
Score
6 1.42 1.83 0.00 1.58 1.71 20.93 1.63 0.00 1.24 1.88 1.38 3.05 12.5 12.5
6 1.06 1.65 0.92 3.57 1.08 19.12 0.67 0.07 3.19 2.38 3.37
7 1.11 0.93 0.01 1.13 1.55 1.63 20.74 0.00 1.54 1.22 0.92 2.80
10.0 10.0
7 4.30 5.94 0.59 1.23 0.16 0.48 18.97 0.04 0.21 2.48 3.44
8 0.10 0.02 0.00 0.04 0.02 0.00 0.00 20.70 0.07 0.02 0.00 1.91
7.5 7.5
9 1.75 1.10 0.05 1.12 1.98 1.24 1.54 0.07 20.98 1.33 1.17 2.94
8 1.73 0.67 0.03 0.13 0.08 0.06 0.06 18.01 0.09 0.33 2.12
5.0 5.0
10 1.60 1.62 0.09 1.82 1.58 1.88 1.22 0.02 1.33 22.48 1.88 3.23 9 0.64 0.22 1.04 0.85 1.43 2.92 0.36 0.08 17.83 1.10 2.65
2.5 10 2.5
R 1.32 0.98 0.01 1.84 2.03 1.38 0.92 0.00 1.17 1.88 20.52 2.91 5.24 4.77 3.58 2.71 2.68 1.70 2.39 0.30 1.48 18.42 4.33
(a) Ad-hoc results for two players. (b) Ad-hoc results for four players.
Figure 5: Ad-hoc team results. Teams were constructed using the 10 best independently
trained ACHA agents (1–10; see Figure 3) and the best Rainbow agent (R; see Figure 4).
Mean scores over 1000 trials are shown. Standard error of the mean was less than 0.09 and
0.255 for two and four players, respectively.
21
6. Hanabi: Related Work
In this section, we discuss prior work on the game of Hanabi. We also briefly
discuss relevant work in multi-agent reinforcement learning.
22
the website these scores are averaged across the results for two to five players.
No further detail on this agent was available on the website.
23
6.3. Beliefs, Bayesian Reasoning, and Theory of Mind
There has been much work that proposes to model beliefs about the in-
tentions or plans of other agents. Several formalisms have been proposed for
this, such as I-POMDPs [73] and Bayesian games [74]. However, algorithms to
compute exact solutions quickly become intractable for complex problems.
Classical models to predict human behavior in games include ways to deal
with imperfect rationality [75, 76]. In recent years, several mechanisms have
been proposed to learn these models using deep learning, from human data in
one-shot games [77], learning players with expert features [78], and end-to-end
prediction of fixed policies in gridworld games [2]. When predictions are used to
exploit or coordinate, this is generally called opponent/teammate modeling [79].
Several MARL algorithms have been recently proposed that learn from mod-
els of others, by using a training regime based on cognitive hierarchies [80], by
defining a learning rule that shapes the anticipated learning of other agents [81],
or by training architectures that incorporate agent identification as part of the
objective and conditioning the policy on these predictions [82]. Lastly, an ap-
proach in poker (DeepStack) introduced function approximators that represent
belief-conditional values and explicitly representing and reasoning about manip-
ulating agents’ beliefs [10], similar to the approach taken in BAD [41].
7. Conclusion
Acknowledgements
24
time zones, and discussions with Cocktail Games; Angeliki Lazaridou for discus-
sion and feedback on pragmatics and theory of mind; Ed Lockhart for many dis-
cussions on small Hanabi-like games and belief-based reasoning; Daniel Toyama
for assistance with writing clear, readable code for the Hanabi research environ-
ment used in our experiments; Kevin Waugh for helpful comments and feedback;
David Wu for providing details on the FireFlower bot; and Danny Tarlow for
early feedback on the project.
References
[6] M. Campbell, A. J. Hoane Jr, F.-h. Hsu, Deep blue, Artificial intelligence
134 (1-2) (2002) 57–83.
[7] J. Schaeffer, R. Lake, P. Lu, M. Bryant, Chinook the world man-machine
checkers champion, AI Magazine 17 (1) (1996) 21.
25
[12] N. Burch, M. Johanson, M. Bowling, Solving imperfect information games
using decomposition, in: Proceedings of the Twenty-Eighth AAAI Confer-
ence on Artificial Intelligence, 2014, pp. 602–608.
URL http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/
8407
[13] N. Burch, Time and space: Why imperfect information games are hard,
Ph.D. thesis, University of Alberta, Alberta, Canada (2017).
[14] J. Zamiell, Github - zamiell/hanabi-conventions: Hanabi conven-
tions for the hyphen-ated group, https://github.com/Zamiell/
hanabi-conventions, [Online; accessed 5-December-2018] (2018).
26
[23] H. P. Grice, Logic and conversation, in: P. Cole, J. L. Morgan (Eds.),
Syntax and semantics, Vol. 3, New York: Academic Press, 1975.
[24] S. T. Piantadosi, H. Tily, E. Gibson, The communicative func-
tion of ambiguity in language, Cognition 122 (3) (2012) 280 – 291.
doi:https://doi.org/10.1016/j.cognition.2011.10.004.
URL http://www.sciencedirect.com/science/article/pii/
S0010027711002496
[25] M. C. Frank, N. D. Goodman, Predicting pragmatic reasoning in lan-
guage games, Science 336 (6084) (2012) 998–998. arXiv:http://
science.sciencemag.org/content/336/6084/998.full.pdf, doi:10.
1126/science.1218633.
URL http://science.sciencemag.org/content/336/6084/998
[26] M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, The Arcade Learn-
ing Environment: An evaluation platform for general agents, Journal of
Artificial Intelligence Research 47 (2013) 253–279.
[27] D. Lewis, Convention: A philosophical study, John Wiley & Sons, 2008.
[28] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht,
M. Bowling, Revisiting the Arcade Learning Environment: Evaluation pro-
tocols and open problems for general agents, Journal of Artificial Intelli-
gence Research 61 (2018) 523–562.
[29] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,
J. Tang, W. Zaremba, Openai gym, CoRR abs/1606.01540. arXiv:
1606.01540.
URL http://arxiv.org/abs/1606.01540
[30] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement
learning, in: Proceedings of the 33rd International Conference on Machine
Learning (ICML), 2016, pp. 1928–1937.
[31] B. Wymann, C. Dimitrakakis, A. Sumner, E. Espié, C. Guionneau, Torcs :
The open racing car simulator (2015).
[32] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler,
A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson,
S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis,
S. Legg, S. Petersen, Deepmind lab, CoRR abs/1612.03801. arXiv:1612.
03801.
URL http://arxiv.org/abs/1612.03801
[33] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,
Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, K. Kavukcuoglu, IM-
PALA: scalable distributed deep-rl with importance weighted actor-learner
architectures, CoRR abs/1802.01561.
27
[34] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G.
Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman,
N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis,
K. Kavukcuoglu, T. Graepel, Human-level performance in first-person mul-
tiplayer games with population-based deep reinforcement learning, CoRR
abs/1807.01281.
[35] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue,
A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando,
K. Kavukcuoglu, Population based training of neural networks, CoRR
abs/1711.09846.
[36] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney,
D. Horgan, B. Piot, M. G. Azar, D. Silver, Rainbow: Combining improve-
ments in deep reinforcement learning, CoRR abs/1710.02298.
[37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al., Human-level control through deep reinforcement learning, Nature
518 (7540) (2015) 529–533.
[38] P. S. Castro, S. Moitra, C. Gelada, S. Kumar, M. G. Bellemare, Dopamine:
A research framework for deep reinforcement learning, arXiv.
[39] M. G. Bellemare, W. Dabney, R. Munos, A distributional perspective on
reinforcement learning, in: Proceedings of the International Conference on
Machine Learning, 2017.
[40] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized experience replay,
in: International Conference on Learning Representations, 2016.
[41] J. N. Foerster, F. Song, E. Hughes, N. Burch, I. Dunning, S. Whiteson,
M. Botvinick, M. Bowling, Bayesian action decoder for deep multi-agent
reinforcement learning, CoRR abs/1811.01458.
[42] A. O’Dwyer, Github - quuxplusone/hanabi: Framework for writing bots
that play hanabi, https://github.com/Quuxplusone/Hanabi, [Online;
accessed 19-September-2018] (2018).
[43] J. Wu, Github - wuthefwasthat/hanabi.rs: Hanabi simulation in rust,
https://github.com/WuTheFWasThat/hanabi.rs, [Online; accessed 29-
November-2018] (2018).
[44] D. Wu, Github - lightvector/fireflower: A rewrite of hanabi-bot in
scala, https://github.com/lightvector/fireflower, [Online; accessed
3-December-2018] (2018).
[45] H. Osawa, Solving Hanabi: Estimating hands by opponent’s actions in
cooperative game with incomplete information, in: Proceedings on the 2015
AAAI Workshop on Computer Poker and Imperfect Information, 2015, pp.
37–43.
28
[46] M. J. H. van den Bergh, A. Hommelberg, W. A. Kosters, F. M. Spieksma,
Aspects of the cooperative card game Hanabi, in: BNAIC 2016: Artificial
Intelligence. 28th Benelux Conference on Artificial Intelligence, Vol. 765 of
CCIS, Springer, Cham, 2016, pp. 93–105.
[47] J. Walton-Rivers, P. R. Williams, R. Bartle, D. Perez-Liebana, S. M. Lucas,
Evaluating and modelling Hanabi-playing agents, in: Proceedings of the
IEEE Conference on Evolutionary Computation (2017), 2017.
[48] B. Bouzy, Playing hanabi near-optimally, in: Advances in Computer
Games: ACG 2017, Vol. 10664 of LNCS, Springer, Cham, 2017, pp. 51–62.
[49] M. Eger, C. Martens, Practical specification of belief manipulation in
games, in: Proceedings, The Thirteenth AAAI Conference on Artificial
Intelligence and Interactive Digital Entertainment (AIIDE-17), 2017.
[50] E. Pacuit, O. Roy, Epistemic foundations of game theory, in: E. N. Zalta
(Ed.), The Stanford Encyclopedia of Philosophy, summer 2017 Edition,
Metaphysics Research Lab, Stanford University, 2017.
[51] J. Walton-Rivers, P. Williams, CIG 2018 Hanabi fireworks com-
petition, http://hanabi.aiclash.com/, https://project.dke.
maastrichtuniversity.nl/cig2018/competitions/, retrieved Sept.
27th, 2018 (2018).
[52] R. Sutton, A. Barto, Reinforcement Learning: An Introduction, 2nd Edi-
tion, MIT Press, 2018.
[53] M. L. Littman, Markov games as a framework for multi-agent reinforcement
learning, in: In Proceedings of the Eleventh International Conference on
Machine Learning, Morgan Kaufmann, 1994, pp. 157–163.
[54] L. Matignon, G. J. Laurent, N. L. Fort-Piat, Independent reinforcement
learners in cooperative Markov games: a survey regarding coordination
problems, The Knowledge Engineering Review 27 (01) (2012) 1–31.
[55] M. Bowling, M. Veloso, Simultaneous adversarial multi-robot learning, in:
IJCAI, Vol. 3, 2003, pp. 699–704.
[56] L. Panait, S. Luke, Cooperative multi-agent learning: The state of the
art, Autonomous Agents and Multi-Agent Systems 11 (3) (2005) 387–434.
doi:10.1007/s10458-005-2631-2.
URL http://dx.doi.org/10.1007/s10458-005-2631-2
[57] L. Busoniu, R. Babuska, B. D. Schutter, A comprehensive survey of mul-
tiagent reinforcement learning, IEEE Transaction on Systems, Man, and
Cybernetics, Part C: Applications and Reviews 38 (2) (2008) 156–172.
[58] A. Nowé, P. Vrancx, Y.-M. D. Hauwere, Game theory and multi-agent
reinforcement learning, in: Reinforcement Learning: State-of-the-Art,
Springer, 2012, Ch. 14, pp. 441–470.
29
[59] D. Bloembergen, K. Tuyls, D. Hennes, M. Kaisers, Evolutionary dynamics
of multi-agent learning: A survey, J. Artif. Intell. Res. (JAIR) 53 (2015)
659–697.
[60] P. Hernandez-Leal, B. Kartal, M. E. Taylor, Is multiagent deep rein-
forcement learning the answer or the question? a brief survey, CoRR
abs/1810.05587. arXiv:1810.05587.
URL http://arxiv.org/abs/1810.05587
[61] J. Baxter, A. Tridgell, L. Weaver, Knightcap: a chess program that learns
by combining TD(lambda) with game-tree search, in: Proceedings of the
15th International Conference on Machine Learning, Morgan Kaufmann,
1998, pp. 28–36.
[62] J. Veness, D. Silver, A. Blair, W. Uther, Bootstrapping from game tree
search, in: Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams,
A. Culotta (Eds.), Advances in Neural Information Processing Systems 22,
Curran Associates, Inc., 2009, pp. 1937–1945.
30
07182.
URL http://arxiv.org/abs/1612.07182
[70] S. Sukhbaatar, A. Szlam, R. Fergus, Learning multiagent communication
with backpropagation, CoRR abs/1605.07736. arXiv:1605.07736.
URL http://arxiv.org/abs/1605.07736
[71] H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, A. Courville,
Guesswhat?! visual object discovery through multi-modal dialogue, in:
IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[72] C. Yeh, H. Lin, Automatic bridge bidding using deep reinforcement learn-
ing, CoRR abs/1607.03290. arXiv:1607.03290.
URL http://arxiv.org/abs/1607.03290
[73] P. J. Gmytrasiewicz, P. Doshi, A framework for sequential planning in
multi-agent settings, JAIR 24.
[74] Y. Shoham, K. Leyton-Brown, Multiagent Systems: Algorithmic, Game-
Theoretic, and Logical Foundations, Cambridge University Press, 2009.
[75] W. Yoshida, R. J. Dolan, K. J. Friston, Game theory of mind, PLOS 4 (12).
[76] J. R. Wright, K. Leyton-Brown, Predicting human behavior in unrepeated,
simultaneous-move games, Games and Economic Behavior 106 (2017) 16–
37.
[77] J. Hartford, J. R. Wright, K. Leyton-Brown, Deep learning for predict-
ing human strategic behavior, in: Thirtieth Annual Conference on Neural
Information Processing Systems (NIPS 2016), 2016.
[78] H. He, J. Boyd-Graber, K. Kwok, H. D. III, Opponent modeling in deep
reinforcement learning, in: Proceedings of the International Conference on
Machine Learning (ICML), 2016.
[79] S. V. Albrecht, P. Stone, Autonomous agents modelling other agents: A
comprehensive survey and open problems, Artificial Intelligence 258 (2018)
66–95.
[80] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pero-
lat, D. Silver, T. Graepel, A unified game-theoretic approach to multiagent
reinforcement learning, in: Advances in Neural Information Processing Sys-
tems, 2017.
[81] J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, I. Mor-
datch, Learning with opponent-learning awareness, CoRR abs/1709.04326.
arXiv:1709.04326.
URL http://arxiv.org/abs/1709.04326
[82] A. Grover, M. Al-Shedivat, J. K. Gupta, Y. Burda, H. Edwards, Learn-
ing policy representations in multiagent systems, in: Proceedings of the
International Conference on Machine Learning (ICML), 2018.
31
Appendix A.
• Hints generally indicate play cards, and generally newer cards first.
• Hints “chain” on to other hints, e.g., if A hints to B a playable red 2 as
red, then B might infer that it is indeed the red 2, and then hint back
to A a red 3 in A’s hand as red, expecting A to believe it is a red 3 to
play it after B plays its red 2.
• When discarding, discard provably useless cards, otherwise the oldest “un-
protected” card.
• Hints about the oldest unprotected card “protect” that card, with many
exceptions where it instead means play.
• Hints about garbage cards indicate “protect” cards older than it.
• Deliberately discarding known playable cards signals to the partner that
they have a copy of that card (with enough convention-based specificity
on location that they may play it with absolutely no actual hints).
• In many cases, hints about cards that have already been hinted change
the belief about those cards in various ways such that the partner will
likely change what they do (in accordance with the broader heuristic that
one usually should give a hint if and only if they would like to change the
partner’s future behavior).
32
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 4 4 3 1 10 2 1 1 0 1 1 6 2 2 1 12 14 11 12 11
2 2 6 5 2 12 1 0 0 0 0 1 4 2 2 1 14 18 13 11 5
3 2 6 11 2 13 1 1 1 1 2 1 3 2 2 1 18 17 11 4 1
4 1 2 5 4 19 0 1 0 0 0 1 6 1 6 2 18 14 10 6 2
5 1 1 1 0 30 1 0 0 0 0 0 3 2 2 2 10 12 16 13 6
Play 1 4 5 3 1 18 4 2 1 1 2 0 2 1 1 1 13 10 10 11 9
2 4 6 5 1 22 3 2 1 0 1 0 2 1 1 0 11 12 11 11 6
3 4 5 2 1 26 4 2 2 1 2 0 3 1 1 1 7 6 8 14 10
4 3 3 4 0 0 3 5 1 1 1 0 3 0 0 0 49 10 1 12 3
5 2 4 5 1 26 2 1 0 0 0 0 3 1 1 0 12 14 12 10 5
Hint R 6 31 25 9 3 0 0 0 0 1 9 4 2 5 2 1 2 0 1 1
33
Colour Y 19 31 12 5 8 1 0 1 0 0 1 7 2 3 1 2 3 2 2 2
G 20 30 18 1 9 2 0 1 0 0 0 4 3 1 1 0 2 2 2 2
W 18 18 16 10 15 0 0 0 0 0 1 4 1 8 0 2 2 1 2 1
B 12 8 11 6 37 2 0 0 0 0 0 2 2 4 3 1 1 3 3 4
Hint 1 3 3 6 8 1 8 9 0 8 50 0 0 0 0 0 3 0 1 0 0
Rank 2 3 6 1 0 0 8 23 0 0 52 0 0 0 0 0 3 1 0 0 0
3 1 5 2 0 1 11 20 10 0 47 0 0 0 0 0 0 1 1 0 0
4 1 4 3 1 1 13 22 6 1 37 0 0 0 0 0 6 2 1 1 0
5 2 4 0 0 2 29 19 7 2 28 0 1 0 0 0 0 1 1 1 2
P% (at ) 3 5 4 2 14 5 7 2 1 17 0 2 1 1 1 9 8 7 7 4
Table A.2: Conditional action probabilities for a first two player Rainbow agent.
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 4 1 1 1 22 1 1 1 0 1 1 2 3 2 0 10 21 9 10 7
2 2 3 1 0 19 1 1 1 1 1 2 2 3 3 0 9 14 19 11 9
3 4 1 6 1 19 2 0 0 0 2 3 3 4 3 3 21 11 9 2 5
4 3 1 1 2 29 1 0 1 0 0 2 3 3 6 2 12 15 11 5 4
5 1 1 0 0 32 1 0 0 0 0 0 3 2 4 0 10 13 14 11 5
Play 1 4 2 1 1 24 3 2 2 0 3 1 1 1 1 0 11 12 10 11 10
2 4 2 1 1 28 3 2 1 0 2 0 1 1 1 0 12 14 12 8 7
3 4 2 1 0 23 4 2 1 0 4 1 2 1 2 0 11 13 7 12 9
4 3 2 0 0 8 4 2 1 0 6 0 0 1 1 0 35 23 5 5 4
5 3 1 1 1 33 2 2 1 0 1 0 2 2 1 0 11 15 12 8 5
Hint R 15 23 8 9 5 1 0 0 0 0 11 0 7 4 4 3 1 1 1 5
34
Colour Y 28 15 11 9 13 1 1 0 0 1 1 4 3 1 0 2 2 3 2 3
G 22 19 5 4 17 1 1 1 0 1 3 2 8 2 0 2 1 2 2 6
W 9 7 2 4 60 1 1 1 0 1 2 2 3 4 0 1 1 1 1 2
B 17 7 21 17 4 1 0 0 1 5 7 0 0 1 10 1 4 1 2 0
Hint 1 2 3 2 6 1 9 11 6 5 54 0 0 0 0 0 0 0 0 0 0
Rank 2 8 2 1 0 1 7 21 0 0 48 0 0 0 0 0 8 3 0 0 0
3 2 2 1 0 1 12 18 11 0 48 0 0 0 0 0 2 1 1 0 0
4 1 1 0 0 1 24 14 9 1 43 0 0 0 0 0 0 0 1 1 1
5 8 4 4 2 2 15 19 7 2 23 0 1 2 0 0 2 1 1 3 3
P% (at ) 4 2 1 1 19 5 7 3 1 17 0 1 2 2 0 8 9 8 6 4
Table A.3: Conditional action probabilities for second two player Rainbow agent.
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 9 1 2 1 15 1 1 0 0 0 3 2 2 0 0 19 19 11 9 5
2 8 4 2 1 16 1 0 0 0 0 6 2 2 1 1 16 20 9 8 2
3 6 1 2 0 24 1 1 0 0 1 3 3 2 0 1 12 16 14 7 4
4 7 1 2 2 27 1 1 0 1 1 3 3 2 1 1 14 13 10 7 3
5 2 0 1 0 32 1 0 0 0 0 3 3 3 0 0 11 12 14 13 4
Play 1 9 1 2 1 21 4 3 1 1 2 1 1 0 0 1 10 12 11 12 8
2 4 1 2 1 25 5 5 1 1 3 1 3 1 0 1 6 7 9 13 12
3 8 2 2 1 22 2 2 1 1 3 1 1 0 0 0 17 13 10 10 4
4 5 1 1 2 2 6 3 2 1 8 0 1 0 0 0 33 13 1 15 3
5 7 1 1 1 29 2 1 1 0 1 2 2 1 0 0 12 15 11 10 4
Hint R 49 15 20 3 4 0 0 0 0 0 0 1 0 0 1 0 3 1 0 1
35
Colour Y 35 5 8 3 12 3 4 1 1 0 1 4 2 0 3 2 2 2 5 7
G 12 4 7 3 55 1 1 0 0 0 1 3 3 1 2 2 1 1 2 1
W 14 13 8 9 27 0 0 0 1 0 2 0 2 12 1 0 2 4 2 1
B 31 2 13 3 12 0 2 1 1 1 2 3 3 1 14 2 1 2 2 3
Hint 1 4 2 5 8 2 10 0 9 6 47 0 0 0 0 0 4 0 1 0 0
Rank 2 9 0 1 0 0 17 0 15 1 48 0 0 0 0 0 4 2 0 0 0
3 2 1 1 0 0 14 9 19 0 50 0 0 0 0 0 1 1 1 0 0
4 3 3 2 1 1 19 14 8 1 38 0 0 0 0 0 5 3 1 2 1
5 2 1 1 0 2 31 14 8 1 30 1 1 0 0 1 1 0 1 2 3
P% (at ) 7 1 2 1 16 7 3 5 1 17 1 2 1 0 0 9 9 7 7 3
Table A.4: Conditional action probabilities for third two player Rainbow agent.
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 3 4 7 3 6 4 3 1 0 1 6 6 5 6 6 11 7 7 5 9
2 4 5 7 2 6 4 2 2 0 0 6 6 5 6 5 12 6 8 5 9
3 4 4 9 2 7 5 3 1 0 0 5 6 5 6 5 10 6 9 5 8
4 4 4 7 3 12 4 3 1 0 0 4 5 3 4 6 11 8 6 5 9
5 1 2 4 1 9 4 3 1 42 0 2 2 2 2 2 3 6 7 5 2
Play 1 3 3 5 3 6 9 5 2 0 2 4 3 4 5 4 8 8 7 6 11
2 4 4 6 4 7 8 5 1 0 1 4 4 4 5 3 9 7 8 5 11
3 3 3 4 2 7 5 3 1 0 1 5 4 4 4 4 12 15 8 4 9
4 4 4 5 4 9 7 4 1 0 1 4 5 4 5 5 9 9 9 5 5
5 5 5 8 5 10 13 7 1 0 0 3 3 3 3 3 6 7 6 4 5
Hint R 4 5 3 38 1 16 14 1 0 0 2 1 1 1 2 3 3 2 2 3
36
Colour Y 4 3 3 45 1 15 12 1 0 0 1 2 1 1 1 3 3 2 1 2
G 5 4 2 37 1 19 14 1 0 0 1 1 2 1 2 3 3 2 1 2
W 3 3 3 44 1 15 13 1 0 0 2 1 1 1 1 3 2 2 1 2
B 3 5 1 44 1 15 12 1 0 0 1 1 1 1 2 3 2 2 2 2
Hint 1 0 0 0 0 0 9 6 9 12 64 0 0 0 0 0 0 0 0 0 0
Rank 2 3 4 2 1 0 2 4 4 22 42 0 0 0 1 0 8 2 1 0 1
3 2 3 1 16 1 5 6 3 26 24 1 1 1 0 1 6 3 1 0 1
4 1 1 0 3 0 2 2 2 55 31 0 0 0 0 0 0 0 0 0 1
5 2 2 2 18 0 6 3 1 55 4 1 0 1 1 1 0 0 1 1 1
P% (at ) 3 3 4 10 5 8 5 2 10 9 3 3 2 3 3 7 6 5 3 5
Table A.5: Conditional action probabilities for first two player ACHA agent.
P% (at+1 |at ) Discard Play Hint Colour Hint Rank
1 2 3 4 5 1 2 3 4 5 R Y G W B 1 2 3 4 5
at at+1
Discard 1 2 2 0 1 31 6 0 0 0 0 9 3 2 3 4 4 2 6 8 16
2 2 2 1 1 33 2 1 1 1 1 8 2 3 4 4 2 2 3 8 17
3 0 2 5 2 0 53 3 4 1 2 5 1 5 2 1 2 2 1 1 10
4 0 0 1 1 0 0 71 0 0 0 4 2 1 1 2 2 2 1 2 11
5 2 1 0 1 31 1 0 0 0 0 11 3 2 6 4 5 3 6 8 16
Play 1 3 3 1 1 27 6 2 1 0 2 7 3 3 3 3 7 2 5 8 12
2 4 3 1 1 27 5 2 1 0 2 6 4 2 4 3 4 3 5 9 13
3 3 2 1 1 25 4 1 1 0 1 8 4 5 4 3 8 3 5 9 12
4 4 2 2 2 31 3 1 0 0 1 7 3 3 7 4 5 3 4 8 11
5 2 1 1 1 33 3 2 0 0 1 11 4 3 5 4 5 3 5 8 9
Hint R 0 0 0 0 0 0 0 0 0 97 0 0 0 0 0 0 0 0 0 1
37
Colour Y 0 0 0 1 0 0 94 0 0 0 1 0 0 0 0 1 0 0 0 1
G 0 0 0 0 0 0 0 0 99 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 98 0 0 0 0 0 0 0 0 0 0 0
B 0 0 0 1 0 0 95 0 0 0 0 0 0 0 0 1 0 0 0 1
Hint 1 0 0 0 1 0 23 0 73 0 0 1 0 0 0 0 0 0 0 0 1
Rank 2 0 0 0 0 0 47 0 0 0 44 1 1 0 1 1 2 0 1 0 1
3 28 4 1 1 5 43 0 1 0 1 1 1 1 1 1 2 1 2 3 1
4 11 32 1 2 9 17 0 3 0 0 1 2 2 3 2 5 1 3 4 2
5 0 0 0 0 0 0 0 0 0 97 0 0 0 0 0 0 0 0 0 1
P% (at ) 3 3 1 1 19 6 6 3 5 15 6 2 2 3 3 4 2 4 6 8
Table A.6: Conditional action probabilities for second two player ACHA agent.