Experiments on the mechanization of game-learning
Part I. Characterization of the model and its parameters
By Donald Michie
This paper describes a trial-and-error device which learns to play the game of Noughts and Crosses.
It was initially constructed front matchboxes and coloured beads and subsequently simulated in
essentials by a program for a Pegasus 2 computer. The parameters governing the adaptive
behaviour of this automaton are described and preliminary observations on its performance are
briefly reported.
A reason for being interested in games is that they This picture of trial-and-error learning uses the
provide a microcosm of intellectual activity in general. concepts and terminology of the experimental psy
Those thought processes which we regard as being chologist. Observations on animals agree with common
specifically h u m a n accomplishments—learning from sense in suggesting that the strength of reinforcement
experience, inductive reasoning, argument by analogy, becomes less as we proceed backwards along the loop
the formation and testing of new hypotheses, and so on from the terminus towards the origin. T h e more recent
—are brought into play even by simple games of mental the choice in the sequence, the greater its probable share
skill. The problem of artificial intelligence consists in of responsibility for the outcome. This provides an
the reduction of these processes to the elementary adequate conceptual basis for a trial-and-error learning
operations of arithmetic and logic. device, provided that the total number of choice-points
The present work is concerned with one particular which can be encountered is small enough for them to be
mental activity, that of trial-and-error learning, and the individually listed.
mental task used for studying it is the game of Noughts
and Crosses, sometimes known as Tic-tac-toe.
F r o m the point of view of one of the players, any game,
such as Tic-tac-toe, represents a sequential decision
process. Sooner or later the sequence of choices ter
minates in an outcome, to which a value is attached,
according to whether the game has been won, drawn
or lost. If the player is able to learn from experience,
the choices which have led u p to a given outcome
receive reinforcements in the light of the outcome value.
In general, positive outcomes are fed back in the form
of positive reinforcement, that is to say, the choices
belonging to the successful sequence become more
probable on later recurrence of the same situations.
Similarly, negative outcomes are fed back as negative
reinforcements. The process is illustrated in Fig. 1.
Fig. 2.—The matchbox machine—MENACE
RE-INFORCEMENT LOOP
The matchbox machine
Fig. 2 shows such a device, known as M E N A C E ,
standing for Matchbox £ducable Noughts And. Crosses
£ngine. The machine shown is equipped to function
as the opening player. The principles by which it
operates are extremely simple and have been described
Fig. 1.—Schematic picture of the reinforcement process during
trial-and-error learning of a game. The numbered boxes elsewhere (Michie, 1961). However, a brief recapitula
represent the players' successive choice-points, and the black tion will here be given.
boxes those of the opponent. Arrows drawn with broken lines Every one of the 287 essentially distinct positions
indicate possible alternative choices open at the given stage which the opening player can encounter in the course
232
Came learning
of play is represented by a separate box, the face of Table 1
which bears a drawing of the position and a code-
number for indexing. The words "essentially distinct" The colour code used in the matchbox machine. The
are emphasized because such variants as those listed in system of numbering the squares is that adopted for the
Fig. 3 are treated as one and the same position. Each subsequent computer simulation program
box contains an assortment of variously coloured beads.
The different colours correspond to the different un- 1 2 3
occupied squares to which moves could be made,
according to the code shown in Table 1. Consider the WHITE LILAC SILVER
box corresponding to the position of Fig. 3. A simple
convention determines which of the four orientations is
to be regarded as standard—in this case the first one
listed. At first sight there are seven possible moves 8 0 4
available. Considerations of symmetry, however, reduce
these to four, namely moves to squares 1, 8, 7 and 6. BLACK GOLD GREEN
Hence the box is equipped with white, black, amber and
red beads.
Imagine that we wish to play against the machine.
In order to ascertain its first move, we remove the box 7 6 5
corresponding to the opening position, shake it and tilt
it forwards. The beads—in this case white, lilac and AMBER RED PINK
gold—run to the front, where a V-shaped partition
selects the first to arrive. The colour of this bead defines
the machine's opening move. The human opponent,
replies, thus generating a fresh position, which might,
for the sake of illustration, be the one shown in Fig. 3. As stated earlier, it is desirable that the strength of
The box corresponding to this position is located, shaken reinforcement should be related to the stage of the game,
and tilted, thus selecting the machine's next move— being maximal for terminal moves and decreasing towards
and so on to the end of the play. the beginning. This general pattern was ensured by
At this stage reinforcements are applied. If the making the number of times each colour in a box was
machine has done badly, it is "punished" by confiscation replicated a decreasing function of the stage of play,
of the selected bead from each of the three or four boxes as shown in Table 2. It can be seen that the system of
which have been used during the play, so that it becomes
less probable, when any of these positions recur in
future play, that the unsuccessful move will be repeated. Table 2
If the machine has done well, it is "rewarded" by adding Variation of the number of colour-replicates of a move
to each of the open boxes an extra bead of the same according to the stage of play (see text)
colour as the selected one. The moves in the successful
sequence thus become more likely to be repeated if and
when any of these positions recur in future. STAGE OF NUMBER OF TIMES
PLAY EACH COLOUR IS REPLICATED
1 4
3 3
O X
5 2
7 1
O
unit bonuses and forfeits will cause more rapid change
of probabilities in late boxes than in early boxes.
For MENACE'S maiden tournament against a human
opponent a draw was reckoned a good result, and
received a unit bonus. A win was regarded as an
Fig. 3.—Four positions which are in reality variants of a single exceptionally good result and was rewarded by three
position extra beads to each open box. A defeat was punished
233
Game learning
' mi
" # #
7 1 6 I CLEAR STORE Ifefc
TEST Ao
147018 15*3'" 170 • 0163587 01785614 • 01736418 4-
01436517 • 0164517 • 0161547 140187s + 14708 4-
0184637 - 10571843 - 1657301 01864 • 1514603 4- + ft
01417856 + 14608531 - 1*47585 1107456 + 016837
15734068 . • 47O8 4-
- SJ
0175,84 • 01315766 - 0138364
1407,8 t 01845631 * 0183156 0175836 - 1473186 -
01763 • 016.71485 1167840 • O163I74 t
li6»43S • 10617348 +
•45J16 • 10467815 t 13014667 ,5137864 - 100 •/
018153 •361870 + H674805 .
151403 143«567»
1015843 * •0435617 - 1,873046 -
1531087 • 140176
013,846 0171863' •
1651471 • •03,56 14630715 - •O453678 •
0164358 106185 OI63481 +
01317 • 16107845 •
0141356 1167804 H7638 4-
161J07 01568174 S145068 •
154S061; - 140636 •5037618 . •50,6746 .
1S83716 156384 ,0615347 -
15478160 . 01637541 0198716 •
•40385 0167415 •54»6l OI76483 •
01784165 157160
105(64 1045S63 •350017 •
•5408163 • 017183
1435876 • 0171643 0163851} •
01457163 -
0167435 017586 • 16S480 4-
I.73568 4-
0163857 0,438 4-
> Jo jo « so «© 70 so to «6o .6 1J0 iio tio lio ito ITO iio MO WO JO MO I » MO 45 1041653 . •JO17646 • 1IO6S45 . I54O7]8 4-
103645s - 16051374 - 0136645 • ,466130 4> •035864 -
»5467J» • 0183741S " 0156347 • «°4»657J • I407S615 -
15*04517 - 11456873 • 1-380157 • 11476583 • 01634517 •
"5407136 " O1457J 8 • 1310586 - 1540768 +
Fig. 4.—The progress of MENACE'S maiden tournament 1641578 + 148513
105»67 34
16045173
•
.
0175683 • 0,413768 -
l,5«6j • 10,7864 - 0,643158 +
against a human opponent. (Reproduced from Penguin Science 11560478 •
1406753
0163781
•
*
1053x86 - 11567430 - 11670483 + . IS"
1503716 • 11738 •
Survey, 2 (1961), p. 139.) The line of dots drops one level for 1051748
0164837
•
-
0156731 . 016351
01418 4-
01674531 +
,1574086 +
•4706 • 1143078 •
a defeat, rises one level for a draw and rises three levels for a •5845710 - 15407836 • 01347 • O131854 4- •05387
•314078* -
•1750863 • 143,680 4- 01431567 4-
victory. The variants listed along the top indicate the different O.J756 • '48153 •
1053681
016478 4- 0,786143 - 1158034* .
1504681 • 0163871 + 10648137 .
replies to the machine's opening move which its opponent 10438617 -
10163
14371
+
+
01734856
1645783
.
•
0.1613457 .
0,643587 • 0176431 *
resorted to 1103845 + 15014368 . 10145673 + ,0147,53 *•
•186047 •
135=460 ^ 10361 + 15713S46 +
.0164517 • 11875463 •
•174io6 - •053766 * 11784305 .
1504601 » 11860735 -
•13567 1470631 01.5-831^ 4-
16150847 - 016357B1 -
•456,3 • 165716 ^6-145870 -
by a unit forfeit. Fig. 4 shows the progress of the
tournament. The slope of the line of dots measures the Fig. 5.—Random play at Noughts and Crosses as simulated
prowess of the machine at any stage. by the computer program. The numerical coding of moves is
as shown in Table 1
Computer simulation program
With the aid of Mr. D. J. M. Martin, of Ferranti Ltd. eight moves, and likewise for victories. Similar con-
a Pegasus 2 computer has been programmed to simulate siderations apply to moves other than the opening move.
the matchbox machine. The computer program steps The second difference from the MENACE reinforce-
into the shoes of both players, Nought and Cross, and ment system concerns the manner in which the move-
plays them against each other at the rate of about a probabilities are modified. The computer program
game a second. Either side can be made to operate as
a learner, or as a non-learner, at any desired standard handles these in the form of odds, where odds = _P , 1
of play from random up to expert. Fig. 5 shows part p being the probability with which a given move is
of a print-out when both sides were playing at random. selected. The reinforcements are stored as multipliers.
There is evidently an inherent bias in the structure of Consider a position from which two alternative moves
the game in favour of the opening player, Nought, to can be made, and suppose that at some stage in the pro-
the extent of about 2 : 1 . Random games have an ceedings the probabilities attached to them are f and i
extremely idiotic character, as can readily be verified by respectively. Suppose that the first of these happens to
playing through one or two examples. be selected, and leads to a win after n moves. If the
The reinforcement system differs from that of the multiplier Mn were, say, 2, the odds on selection of this
matchbox machine in two main ways. First, the stage move in future would be converted from 2 : 3 to 4 : 3,
of play to which a move belongs is reckoned backwards and the corresponding probabilities of the two moves
from the end of the play. Thus, the opening move of a adjusted to % and $. The multipliers for losing outcomes
long play might stand as much as eight moves from the are the reciprocals of those for winning outcomes.
end, and hence receive relatively mild reinforcement, Fig. 6 shows the values for the trial run, and the function
since the strength of reinforcement decays for moves which was used to generate these values.
successively further from the end. In a short play, Fig. 7 shows the progress of Nought, learning against
Nought's opening move might be the fifth from the end, a random Cross. It is an enormously more difficult and
and be relatively strongly reinforced. This is not time-consuming task to learn against random play than
unreasonable, since the weight of evidence provided against play which is expert, or stereotyped in some
against an opening move by a defeat in five moves is other fashion. In the latter case only a restricted subtree
obviously greater than that provided by a defeat in of the whole game-tree has to be explored. For this
234
Game learning
CM (itomt RMt) U-MMUI .
J-O fe.mFQRC£mZrJTS uSE-b FoR
THE TRIAL ROr/S
4-0
Mn = I-02S
are mulftfUtdi l « Cl
fv OJ\A (rv
2o
Fig. 8.—Cross is learning, Nought playing at random
Fig. 6.—The multipliers used for reinforcement in the trial
runs of Figs. 6-8. A special case is shown of the general form
Fig. 9.—Both sides are learning
reason the speed of learning shown cannot be compared
with the performance of MENACE in the matchbox
machine's maiden tournament. It will be seen that
neither of the duplicate runs converged to an infallible
standard of play. This is almost certainly because the
reinforcements were too strong, so that the machine
Fig._ 7.—Trial runs with the computer program. Nought (the jumped to premature conclusions from which it never
opening player) is learning, while Cross is playing at random entirt' • rid itself. Fig. 8 shows Cross learning against
throughout. The left-hand block shows the average results randon. Nought, and presents essentially similar features.
when both sides play at random Fig. 9 shows what happened when both players were
235
Game learning
Table 3 outcome has been defeat, and punished when the usual
outcome has been victory. Similar considerations apply
Adjustment of multipliers to a sliding origin to the values of winning and losing outcomes. • The
After theyth play (i is calculated as method which has been adopted is the following.
i i ) j The value of a win is rated at + 1 , that of a draw at 0
1
nCt > + )y. and that of a defeat at —1, and a weighted average, fi,
of past outcome values is formed using as weight a
where Vt is the outcome value of the ith play and Vo is decay factor D (0 < D < 1). Thus the weight of the
set equal to 0 (value of a win is + 1 , of a draw is 0, and of last outcome is D, that of the penultimate outcome is D2,
a defeat is —1). D is the decay factor and Mn is the that of the antepenultimate outcome is D3, and so on.
unadjusted multiplier for the nth stage of the game (see The smaller the value chosen for D, the more weight
text). is given to recent experience; as D approaches unity,
increasing weight is given to the experience of the more
remote past. In theory, a running calculation is made
OUTCOME REINFORCEMENT to evaluate \L after each play, and this is used to adjust
the multipliers as shown in Table 3. The implementa-
Won K = A/-*+' tion in the current version of the program does not
actually attain this ideal, but makes an approximation.
Drawn Rn = M-^ The decay factor is only applied to the average of each
set of one hundred plays.
Lost Rn = M - * - ' Our model of trial-and-error learning is thus based on
three adjustable parameters, A, B and D (see Fig. 6
and Table 3). The next paper of this series will describe
allowed to learn. After a few hundred games the two the effects upon learning performance which result from
sides were both producing near-expert play. the systematic variation of these parameters.
Improvements to the program Acknowledgements
These results are only preliminary. The program has The work described was supported by a Grant-in-Aid
now been modified so that the value of an outcome is from the Royal Society. My thanks are also due to
assessed against the average outcome of past plays, Messrs. Bruce Peebles and Co., Edinburgh, and to
instead of remaining fixed. It seems obvious that a Mr. W. A. Sharpley personally, for making computing
draw, for example, should be rewarded when the usual facilities available to me.
Reference
MICHIE, D. (1961). "Trial and Error," Science Survey, 1961, Harmondsworth: Penguin, Part 2, pp. 129-145.
Correspondence
To the Editor, unambiguous code in referring to the last word, say the first
The Computer Journal. and third, or better still the ultimate and antepenultimate?
e.g. Selections from Borrow SLWR
Dear Sir, Selections from Byron SLNR
"Direct coding of English language names", The Computer Short History . . . etc. . . . Augurelius SOSI
Journal, Vol. 6, No. 2 {July), p. 113 Short History . . . etc. . . . Augustus SOST
Surely the duplication in book titles tends to occur at Yours faithfully,
the beginning. Could a solution be found for a short E. ]R. KERMODE
236