Nash Equilibria For Stochastic Games With Asymmetric Information-Part 1: Finite Games
Nash Equilibria For Stochastic Games With Asymmetric Information-Part 1: Finite Games
Abstract
                                                   A model of stochastic games where multiple controllers jointly control the evolution of the state
                                              of a dynamic system but have access to different information about the state and action processes
                                              is considered. The asymmetry of information among the controllers makes it difficult to compute
                                              or characterize Nash equilibria. Using common information among the controllers, the game with
                                              asymmetric information is shown to be equivalent to another game with symmetric information. Further,
                                              under certain conditions, a Markov state is identified for the equivalent symmetric information game
                                              and its Markov perfect equilibria are characterized. This characterization provides a backward induction
                                              algorithm to find Nash equilibria of the original game with asymmetric information in pure or behavioral
                                              strategies. Each step of this algorithm involves finding Bayesian Nash equilibria of a one-stage Bayesian
                                              game. The class of Nash equilibria of the original game that can be characterized in this backward manner
                                              are named common information based Markov perfect equilibria.
Index Terms
I. I NTRODUCTION
                                           Stochastic games model situations where multiple players jointly control the evolution of
                                        the state of a stochastic dynamic system with each player trying to minimize its own costs.
                                        Stochastic games where all players have perfect state observation are well-studied [1][5]. In
                                        such games, the symmetry of information among players implies that they all share the same
                                        uncertainty about the future states and future payoffs. However, a number of games arising in
                                          The authors are with Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign. Email:
                                        {anayyar,gupta54,langbort,basar1}@illinois.edu
asymmetric information games that are amenable to such a decomposition. The basic concep-
tual observation underlying our results is the following: the essential impediment to applying
backward induction in asymmetric information games is the fact that a players posterior beliefs
about the system state and about other players information may depend on the strategies used
by the players in the past. If the nature of system dynamics and the information structure of
the game ensures that the players posterior beliefs are strategy independent, then a backward
induction argument is feasible. We formalize this conceptual argument in this paper.
   We first use the common information among the controllers to show that the game with
asymmetric information is equivalent to another game with symmetric information. Further, under
the assumption of strategy independence of posterior beliefs, we identify a Markov state for the
equivalent symmetric information game and characterize its Markov perfect equilibria using
backward induction arguments. This characterization provides a backward induction algorithm
to find Nash equilibria of the original game with asymmetric information. Each step of this
algorithm involves finding Bayesian Nash equilibria of a one-stage Bayesian game. The class
of Nash equilibria of the original game that can be characterized in this backward manner are
named common information based Markov perfect equilibria. For notational convenience, we
consider games with only two controllers. Our results extend to games with n > 2 controllers
in a straightforward manner.
   Our work is conceptually similar to the work in [16]. The authors in [16] considered a model
of finite stochastic game with discounted infinite-horizon cost function where each player has a
privately observed state. Under the assumption that player is belief about other players states
depends only the current state of player i and does not depend on player is strategy, [16]
presented a recursive algorithm to compute Nash equilibrium. Both our model and our main
assumptions differ from those in [16].
A. Notation
   Random variables are denoted by upper case letters; their realizations by the corresponding
lower case letters. Random vectors are denoted by upper case bold letters and their realizations by
lower case bold letters. Unless otherwise stated, the state, action and observations are assumed
to be vector valued. Subscripts are used as time index. Xa:b is a short hand for the vector
(Xa , Xa+1 , . . . , Xb ), if a > b, then Xa:b is empty. P() is the probability of an event, E() is the
B. Organization
   The rest of this paper is organized as follows. We present our model of a stochastic game
with asymmetric information in Section II. We present several special cases of our model in
Section III. We prove our main results in Section IV. We extend our arguments to consider
behavioral strategies in Section V. We examine the importance of our assumptions in Section VI.
Finally, we conclude in Section VII.
There are two observation processes: Yt1 Yt1 , Yt2 Yt2 , where
   At any time t, the vector Iit denotes the total data available to controller i at time t. The
vector Iit is a subset of the collection of potential observables of the system at time t, that
            1      2
is, Iit  {Y1:t , Y1:t , U11:t1 , U21:t1 }. We divide the total data into two components: private
information Pit and common information Ct . Thus, Iit = (Pit , Ct ). As their names suggest, the
common information is available to both controllers whereas private information is available
only to one controller. Clearly, this separation of information into private and common part can
always be done. In some cases, common or private information may even be empty. For example,
                 1
if I1t = I2t = {Y1:t    2
                     , Y1:t , U11:t1 , U21:t1 }, that is if both controllers have access to all observations
and actions, then Ct = I1t = I2t and P1t = P2t = . On the other hand, if Iit = Y1:t
                                                                                 i
                                                                                     , for i = 1, 2,
then Ct =  and Pit = Iit . Games where are all information is common to both controllers are
referred to as symmetric information games.
   We denote the set of possible realizations of Pit as Pti and the set of possible realizations of
Ct as Ct . Controller i chooses action Uit as a function of the total data (Pit , Ct ) available to it.
Specifically, for each controller i,
                                                  Uit = gti (Pit , Ct ),                                       (3)
where gti , referred to as the control law at time t, can be any function of private and common
information. The collection gi = (g1i , . . . , gTi ) is called the control strategy of controller i and
the pair of control strategies for the two controllers (g1 , g2 ) is called a strategy profile. For a
given strategy profile, the overall cost of controller i is given as
                                                          T
                                                         hX                                  i
                                          1   2
                                    i
                                  J (g , g ) := E                c   i
                                                                         (Xt , U1t , U2t )    ,                (4)
                                                           t=1
where the expectation on the right hand side of (4) is with respect to the probability measure on
the state and action processes induced by the choice of strategies g1 , g2 on the left hand side of
(4). A strategy profile (g1 , g2 ) is called a Nash equilibrium if no controller can lower its total
expected cost by unilaterally changing its strategy, that is,
Remark 1 The system dynamics and the observation model (that is, the functions ft , h1t , h2t in
(1) and (2)), the statistics of the primitive random variables, the information structure of the
game and the cost functions are assumed to be common knowledge among the controllers.                                    
Assumption 1 We assume that the common and private information evolve over time as follows:
   1) The common information Ct is increasing with time, that is, Ct  Ct+1 for all t. Let
       Zt+1 = Ct+1 \ Ct be the increment in common information from time t to t + 1. Thus,
       Ct+1 = {Ct , Zt+1 }. Further,
                                              Pit+1 = t+1
                                                       i
                                                           (Pit , Uit , Yt+1
                                                                         i
                                                                             )                                          (7)
              i
       where t+1 , i = 1, 2, are fixed transformations.
   Equation (6) states that the increment in common information is a function of the new
variables generated between t and t + 1, that is, the actions taken at t and the observations made
at t + 1, and the old variables that were part of private information at time t. Equation (7)
implies that the evolution of private information at the two controllers is influenced by different
observations and actions.
   A key concept in our analysis is the belief about the state and the private informations
conditioned on the common information of both controllers. Formally, at any time t, given
the control laws from time 1 to t  1, we define the common information based conditional
belief as follows:
                                 1     2
        t (xt , p1t , p2t ) := Pg1:t1 ,g1:t1 (Xt = xt , P1t = p1t , P2t = p2t |Ct )     for all xt , p1t , p2t ,     (8)
                              1        2
where we use the superscript g1:t1 , g1:t1 in the RHS of (8) to emphasize that the conditional
belief depends on the past control laws. Note that t (, , ) is a |Xt  Pt1  Pt2 |-dimensional
1t = gt1 (, Ct ) 2t = gt2 (, Ct )
These partial functions are functions from the private information of a controller to its control
action. These are random functions whose realizations depend on the realization of the random
vector Ct . The following lemma describes the evolution of the common information based
conditional belief using these partial functions.
                                             1      2
Lemma 1 Consider any choice of control laws g1:t , g1:t . Let t be the realization of the common
information based conditional belief at time t, let ct be the realization of the common information
at time t, let ti = gti (, ct ), i = 1, 2, be the corresponding realizations of the partial functions
at time t, and zt+1 be the realization of the increment in common information (see Assumption
1). Then, the realization of the conditional belief at time t + 1 is given as
where Ft is a fixed transformation that does not depend on the control strategies.
Assumption 2 (Strategy Independence of Beliefs) Consider any time t, any choice of control
      1        2
laws g1:t1 , g1:t1 , and any realization of common information ct that has a non-zero probability
       1        2                                                   1         2
under g1:t1 , g1:t1 . Consider any other choice of control laws g1:t1 , g1:t1 which also gives a
non-zero probability to ct . Then, we assume that
       1      2                                                  1      2
   Pg1:t1 ,g1:t1 (Xt = xt , P1t = p1t , P2t = p2t |ct ) = Pg1:t1 ,g1:t1 (Xt = xt , P1t = p1t , P2t = p2t |ct ),
      Equivalently, the evolution of the common information based conditional belief described in
Lemma 1 depends only on the increment in common information, that is, (9) can be written as
where Ft is a fixed transformation that does not depend on the control strategies.
      Before proceeding with further analysis, we first describe some instances of G1 where the
nature of the dynamic system and the private and common information implies that Assumptions
1 and 2 hold.
      Consider the instance of G1 where the common information at any time t is given as Ct =
  1        2
{Y1:t1 , Y1:t1 , U11:t1 , U21:t1 } and the private information is given as Pit = Yti . Thus, Zt+1 :=
Ct+1 \ Ct = {Yt1, Yt2 , U1t , U2t }. This information structure can be interpreted as the case where
all observations and actions are shared among controllers with one step delay.
Lemma 2 The game with one-step delayed sharing information pattern described above satisfies
Assumptions 1 and 2.                                                                                    
  1
      Appendices F-J are included in the Supplementary Material section at the end of the paper.
   Similar to the one-step delay case, we consider the situation where all observations of controller
1 are available to controller 2 with no delay while the observations of controller 2 are available
to controller 1 with one-step delay. All past control actions are available to both controllers.
                              1      2
That is, in this case, Ct = {Y1:t , Y1:t1 , U11:t1 , U21:t1 }, Zt+1 = {Yt+1
                                                                           1
                                                                               , Yt2 , U1t , U2t }, controller
1 has no private information and the private information of controller 2 is P2t = Yt2 .
Lemma 3 The game with one-directional-one-step delayed sharing information pattern de-
scribed above satisfies Assumptions 1 and 2.                                                                
   Case A: Consider the special case of G1 where the state dynamics are controlled only by
controller 1, that is,
                                            Xt+1 = ft (Xt , U1t , Wt0).
                              1      2
                       Ct = {Y1:t , Y1:td , U11:t1 },       P1t = ,        P2t = Ytd+1:t
                                                                                     2
                                                                                             .
That is, controller 1s observations are available to controller 2 instantly while controller 2s
observations are available to controller 1 with a delay of d  1 time steps.
   Case B: Similar to the above case, consider the situation where the state dynamics are still
controlled only by controller 1 but the information structure is:
                            1        2
                     Ct = {Y1:t1 , Y1:td , U11:t1 },       P1t = Yt1 ,         P2t = Ytd+1:t
                                                                                         2
                                                                                                 .
     Noiseless Observations: We now consider the information structure described in [18]. In this
example, the state Xt has three components: a global state Xt0 and a local state Xti for each
controller. The state evolution is given by the following equation:
Note that the dynamics depend on the current global state Xt0 but not on the current local states.
                                                        0
Each controller has access to the global state process X1:t and its current local state Xti . In
addition, each controller knows the past actions of all controllers. Thus, the common and private
information in this case are:
                                       0
                                Ct = {X1:t , U11:t1 , U21:t1 },          Pit = {Xti }
                                                                              0
It is easy to verify that the above belief depends only on the statistics of Wt1 and is therefore
independent of control laws. Thus, Assumption 2 also holds for this case.
     Noisy Observations: We can also consider a modification of the above scenario where both
controllers have a common, noisy observation Yt0 = ht (Xt0 , Wt1 ) of the global state. That is,
                      0
              Ct = {Y1:t , U11:t1 , U21:t1 },         Pit = {Xti },                   0
                                                                              Zt+1 = {Yt+1 , U1t , U2t }.
Lemma 5 The game with the information pattern described above satisfies Assumptions 1 and
2.                                                                                                                  
   Consider a state process whose evolution does not depend on the control actions, that is, the
system state evolves as
                                             Xt+1 = ft (Xt , Wt0)                                            (13)
                                               Pit+1 = t+1
                                                        i
                                                            (Pit , Yt+1
                                                                    i
                                                                        )                                    (15)
              i
       where t+1 , i = 1, 2, are fixed transformations.
Note that while control actions do not affect the state evolution, they still affect the costs.
Lemma 6 The game G1 with an uncontrolled state process described above satisfies Assump-
tions 1 and 2.                                                                                                     
   Consider the case when all observations and actions are available to both controllers, that is,
I1t = I2t = Ct = {Y1:t
                   1      2
                       , Y1:t , U11:t1 , U21:t1 } and there is no private information. The common
                                                              1     2
                                                                               1      2
information based belief in this case is t (xt ) = Pg1:t1 ,g1:t1 (Xt = xt |y1:t , y1:t , u11:t1 , u21:t1 ).
t is the same as the information state in centralized stochastic control, which is known to
be control strategy independent and which satisfies an update equation of the form required
in Assumption 2 [19]. A related case with perfect state observations is the situation where
I1t = I2t = X1:t .
   A combination of the previous two scenarios is the situation when the state Xt consists of two
independent components: a controlled component Xta and an uncontrolled component Xtb . Both
components are observed through noisy channels. The observations about the controlled state
as well as the past actions are common to both controllers whereas the information about the
uncontrolled state satisfies the model of Section III-E. The common information based conditional
belief can then be factored into two independent components each of which satisfies an update
equation of the form required by Assumption 2.
   Our goal in this section is to show that under Assumptions 1 and 2, a class of equilibria of
the game G1 can be characterized in a backward inductive manner that resembles the backward
inductive characterization of Markov perfect equilibria of symmetric information games with
perfect state observation. However, in order to do so, we have to view our asymmetric information
game as a symmetric information game by introducing virtual players that make decisions
based on the common information. This section describes this change of perspective and how it
can be used to characterize a class of Nash equilibria.
   We reconsider the model of game G1. We assume that controller i is replaced by a virtual
player i (VP i). The system operates as follows: At time t, the data available to each virtual
player is the common information Ct . The virtual player i selects a function it from Pti to Uti
according to a decision rule it ,
                                                 it = it (Ct )
Note that under a given decision rule it , it is a random function since Ct is a random vector.
We will use ti to denote a realization of it . We will refer to it as the prescription selected by
virtual player i at time t. Once the virtual player has chosen it , a control action Uit = it (Pit )
is applied to the system. i := (i1 , i2 , . . . , iT ) is called the strategy of the virtual player i. The
total cost of the virtual player i is given as
                                                      T
                                                     hX                              i
                                        1   2
                                   i
                                 J ( ,  ) := E              ci (Xt , U1t , U2t )                       (16)
                                                        t=1
where the expectation on the right hand side of (16) is with respect to the probability measure
on the state and action processes induced by the choice of strategies 1 , 2 on the left hand side
of (16). We refer to the game among the virtual players as game G2.
Remark 3 In case there is no private information, the function it from Pti to Uti is interpreted
as simply a value in the set Uti .                                                               
Theorem 1 Let (g1 , g2) be a Nash equilibrium of game G1. Define i for i = 1, 2, t =
1, 2, . . . , T as
                                             it (ct ) := gti (, ct),                        (17)
for each possible realization ct of common information at time t. Then (g1 , g2 ) is a Nash
equilibrium of game G1.                                                                          
      Proof: It is clear that using (17), any controller strategy profile in game G1 can be trans-
formed to a corresponding virtual player strategy profile in game G2 without altering the behavior
of the dynamic system and in particular the values of the expected costs. If a virtual player
can reduce its costs by unilaterally deviating from i , then such a deviation must also exist
for the corresponding controller in G1. Therefore, equilibrium of controllers strategies implies
equilibrium of corresponding virtual players strategies. The converse can be shown using similar
arguments.
   The game between the virtual players is a symmetric information game since they both make
their decisions based only on the common information Ct . In the next section, we identify a
Markov state for this symmetric information game and characterize Markov perfect equilibria
for this game.
   We want to establish that the common information based conditional beliefs t (defined in
(8)) can serve as a Markov state for the game G2. Firstly, note that because of Assumption 2,
t depends only on the common information Ct and since both the virtual players know the
common information, the belief t is common knowledge among them. The following lemma
shows that t evolves as a controlled Markov process.
                                      1      2                       1      2
                 P(t+1 |ct , 1:t , 1:t , 1:t ) = P(t+1 |1:t , 1:t , 1:t ) = P(t+1 |t , t1 , t2 )    (19)
Lemma 8 If virtual player i is using a decision strategy that selects prescriptions only as a
function of the belief t , that is,
                                                      it = ti (t ),
t = 1, . . . , T, then virtual player j can also choose its prescriptions only as a function of the
belief t without any loss of performance.                                                                        
   Given a Markov perfect equilibrium of G2, we can construct a corresponding Nash equilibrium
of game G1 using Theorem 1. We refer to the class of Nash equilibria of G1 that can be
constructed from the Markov perfect equilibria of G2 as the common information based Markov
perfect equilibria of G1.
Definition 2 A strategy profile (g1 , g2) of the form Uit = gti (Pit , t ), i = 1, 2, is called a common
information based Markov perfect equilibrium for game G1 if the corresponding strategies of
game G2 defined as
                                                  ti (t ) := gti (, t ),
The following theorem provides a necessary and sufficient condition for a strategy profile to be
a Markov perfect equilibrium of G2.
Theorem 2 Consider a strategy pair ( 1 ,  2 ) such that at each time t, the strategies select
prescriptions based only on the realization of the common information based belief t , that is,
ti = ti (t ), i = 1, 2
                VT1 () := min
                            1
                               E[c1 (Xt , 1T (P1T ), 2T (P2T ))|T = , 1T =  1 , 2T = T2 ()]         (20)
                                
       Then, T1 () must be a minimizing  1 in the definition of VT1 (). Similarly, define the value
       function for virtual player 2:
                     VT2 () := min
                                 2
                                    E[c2 (Xt , 1T (P1T ), 2T (P2T ))|T = , 1t = T1 (), 2T =  2 ]    (21)
                                     
   2) For t = T  1, . . . , 1 and for each possible realization  of t , define recursively the value
       functions for virtual player 1:
           Vt1 () := min
                       1
                          E[c1 (Xt , 1t (P1t ), 2t (P2t )) + Vt+1
                                                                 1
                                                                    (t+1 )|t = , 1t =  1 , 2t = t2 ()]
                           
(22)
             Vt2 () := min
                         2
                            E[c2 (Xt , 1t (P1t ), 2 (P2t )) + Vt+1
                                                                  2
                                                                     (t+1 )|t = , 1t = t1 (), 2t =  2 ]
                            
                                                                                                                        (23)
          where t+1 = Ft (t , Zt+1 ). Then, t2 () must be a minimizing  2 in the definition of
          Vt2 ().                                                                                                          
Remark 4 Note that if Assumption 2 were not true, then according to Lemma 1, t+1 =
Ft (t , 1t , 2t , Zt+1 ). In this case, the entire prescription  i will affect the second term in the
expectation in (22)-(23), and we could not hope to carry out the minimization over  i by
separately minimizing over  i (pi ) for all possible pi .                                                                 
      We can now describe a backward inductive procedure to find a Markov perfect equilibrium
of game G2 using a sequence of one-stage Bayesian games. We proceed as follows:
Algorithm 1:
      1) At the terminal time T , for each realization  of the common information based belief at
          time T , we define a one-stage Bayesian game SGT () where
            a) The probability distribution on (XT , P1T , P2T ) is .
            b) Agent2 i observes PiT and chooses action UiT , i = 1, 2.
  2
      Agent i can be thought to be the same as controller i. We use a different name here in order to maintain the distinction
between games G1 and SGT ().
                                      min
                                       i
                                          E [ci (XT , ui ,  j (PjT ))|PiT = pi ],
                                       u
       where j 6= i and the superscript  denotes that the expectation is with respect to the
       distribution . (See [21], [22] for a definition of Bayesian Nash equilibrium.) If a Bayesian
       Nash equilibrium  1 ,  2 of SGT () exists, denote the corresponding expected equilibrium
       costs as VTi (), i = 1, 2 and define Ti () :=  i , i = 1, 2.
   2) At time t < T , for each realization  of the common information based belief at time t,
       we define the one-stage Bayesian game SGt () where
          a) The probability distribution on (Xt , P1t , P2t ) is .
          b) Agent i observes Pit and chooses action Uit , i = 1, 2.
          c) Agent is cost is ci (Xt , U1t , U2t ) + Vt+1
                                                        i
                                                           (Ft (, Zt+1 )), i = 1, 2.
       Recall that the belief for the next time step is t+1 = Ft (, Zt+1 ) and Zt+1 is given by
       (6). A Bayesian Nash equilibrium of this game is a pair of strategies  i , i = 1, 2, for the
       agents which map their observation Pit to their action Uit such that for any realization pi ,
        i (pi ) is a solution of the minimization problem
      Proof: To prove the result, we just need to observe that the strategies defined by the backward
induction procedure of Algorithm 1 satisfy the conditions of Theorem 2 and hence form a Markov
perfect equilibrium of game G2. See Appendix E for a more detailed proof.
   We consider an example of game G1 where the (scalar) state Xt and the (scalar) control
actions Ut1 , Ut2 take value in the set {0, 1}. The state evolves as a controlled Markov chain
depending on the two control actions according to the state transition probabilities:
                                        	 1
          P Xt+1 = 0Xt = 0, Ut1 = Ut2 = ,
                     
                                             4
                                             1
          P Xt+1 = 0Xt = 1, Ut1 = Ut2 = ,
                                      	
                                             2
                                                                             	 2
          P Xt+1 = 0Xt = 0, Ut1 6= Ut2 = P Xt+1 = 0Xt = 1, Ut1 6= Ut2 = .
                                      	                 
                                                                                                   (24)
                                                                                   5
The initial state is assumed to be equi-probable, i.e., P {X1 = 0} = P {X1 = 1} = 1/2. The
first controller observes the state perfectly, while the second controller observes the state through
a binary symmetric channel with probability of error 1/3. Thus,
                                        
                                         X
                                             t        with probability 32 ,
                    Yt1 = Xt ,    Yt2 =
                                         1  Xt with probability 1 .
                                                                       3
The controllers share the observations and actions with a delay of one time step. Thus, the
common information and private informations at time step t are given as
                                  2       1        2
                 Ct = {X1:t1 , Y1:t1 , U1:t1 , U1:t1 },    P1t = {Xt },      P2t = {Yt2 }.
In the equivalent game with virtual players, the decision of the ith virtual player, it , is a function
that maps Yti := {0, 1} to Uti := {0,1}.
   The common information based belief for this case is the belief on (Xt , Yt2 ) given the common
                      2
information x1:t1 , y1:t1 , u11:t1 , u21:t1 , that is,
The above equation implies that the distribution t is completely specified by xt1 , u1t1 , u2t1 .
That is,
                                          t = Ft1 (xt1 , u1t1 , u2t1 ).                          (26)
(Note that Ft1 is a vector-valued function whose components are given by (25) for all x, y 2 
{0, 1}.) The cost functions ci (x, u1 , u2 ) for various values of state and actions are described by
the following matrices
                                             xt = 0              xt = 1
                                            0      1            0      1
                                       0 1, 0 0, 1         0 0, 0 1, 1
                                       1 0, 1 0, 0         1 0, 1 1, 0
where the rows in each matrix correspond to controller 1s actions and the columns correspond
to controller 2s actions. The first entry in each element of the cost matrix is controller 1s cost
and second entry is controller 2s cost.
   Applying Algorithm 1:
We now use Algorithm 1 for a two-stage version of the game described above.
   1) At the terminal time step T = 2, for a realization  of the common information based
       belief at time 2, we define a one stage game SG2 () where
           a) The probability distribution on (X2 , Y22 ) is .
           b) Agent 1 observes X2 and selects an action U21 ; Agent 2 observes Y22 and selects U22 .
           c) Agent is cost is ci (X2 , U21 , U22 ), given by the matrices defined above.
       A Bayesian Nash equilibrium of this game is a pair of strategies  1 ,  2 , such that
              For x = 0, 1,  1 (x) is a solution of minu1 E [c1 (X2 , u1,  2 (Y22 ))|X2 = x].
              For y = 0, 1,  2 (y) is a solution of minu2 E [c2 (X2 ,  1 (X2 ), u2 )|Y22 = y].
       It is easy to verify that
       is a Bayesian Nash equilibrium of SG2 (). The expected equilibrium cost for agent i is
                                                    
                                                     (X = 1) for i = 1,
                         i         i                    2
                       V2 () = E [c (X2 , 1, 1)] =                                       (27)
                                                     0          for i = 2
       where (X2 = 1) is the probability that X2 = 1 under the distribution . From the above
       Bayesian equilibrium strategies, we define the virtual playerss decision rules for time
       T = 2 as 2i () =  i , i = 1, 2.
   2) At time t = 1, since there is no common information, the common information based
       belief 1 is simply the prior belief on (X1 , Y12 ). Since the initial state is equally likely to
       be 0 or 1,                                                                   
                                              2   1        2           1
                                     1 (x, y ) =            1{y2 =x} + 1{y2 6=x}
                                                  2        3           3
       We define the one-stage Bayesian game SG1 (1 ) where
          a) The probability distribution on (X1 , Y12 ) is 1 .
          b) Agent 1 observes X1 and selects an action U11 ; Agent 2 observes Y12 and selects U12 .
          c) Agent is cost is given by ci (X1 , U11 , U12 ) + V2i (F1 (X1 , U11 , U12 )), where F1 , defined
               by (26) and (25), gives the common information belief at time 2 as a function of
               X1 , U11 , U22 , and V2i , defined in (27), gives the expected equilibrium cost for time 2
               as a function of the common information belief at time 2.
               For example, if U11 6= U12 , then (25), (26) and (27) imply V21 (F1 (X1 , U11 , U12 )) = 3/5.
               Similarly, if U11 = U12 , then (25), (26) and (27) imply V21 (F1 (0, U11 , U12 )) = 3/4 and
               V21 (F1 (1, U11 , U12 )) = 1/2. Also, (27) implies that V22 is identically 0.
       A Bayesian Nash equilibrium of this game is a pair of strategies  1 ,  2 such that
             For x = 0, 1,  1 (x) is a solution of
                          min
                           1
                              E1 [c1 (X1 , u1,  2 (Y12 )) + V21 (F1 (X1 , u1 ,  2 (Y12 )))|X1 = x].
                           u
                          min
                           2
                              E1 [c2 (X1 ,  1 (X1 ), u2 ) + V22 (F1 (X1 ,  1 (X1 ), u2 ))|Y12 = y].
                           u
is a Bayesian Nash equilibrium of SG1 (). The expected equilibrium costs are
       which gives V11 (1 ) = 47/60 and V12 (1 ) = 1/3. From the above Bayesian equilibrium
       strategies, we define the virtual playerss decision rules for time t = 1 as 1i (1 ) =  i ,
       i = 1, 2.
       Since we now know the equilibrium decision rules ti , i = 1, 2, t = 1, 2 for the virtual
       players, we can construct the corresponding control laws for the controllers using Theo-
       rem 3. Thus, a common information based Markov perfect equilibrium for the game in
       this example is given by the strategies:
                                                                            
                                  1 if x = 0,                                1 if y 2 = 0,
                                         1                                           1
                g11 (x1 , 1 ) =                           g12 (y12 , 1 ) =
                                  0 if x1 = 1.                               0 if y 2 = 1.
                                                                                     1
and
   The results of Theorems 2 and 3 provide sufficient conditions for a pair of strategies to
be an equilibrium of game G2. Neither of these results addresses the question of existence
of equilibrium. In particular, the result of Theorem 3 states that the (pure strategy) Bayesian
Nash equilibria of the one-stage Bayesian games SGt (), t = T, . . . , 1, may be used to find a
Markov perfect equilibrium of game G2 and hence a common information based Markov perfect
equilibrium of G1. However, the games SGt () may not have any (pure strategy) Bayesian Nash
equilibrium.
   As is common in finite games, we need to allow for behavioral strategies in order to ensure
the existence of equilibria. Toward that end, we now reconsider the model of game G1. At each
time t, each controller is now allowed to select a probability distribution Dit over the (finite) set
of actions Uti , i = 1, 2 according to a control law of the form:
The rest of the model is the same as in Section II. We denote the set of probability distributions
over Uti by (Uti ).
   Following exactly the same arguments as in Section IV, we can define an equivalent game
where virtual players select prescriptions that are functions from the set of private information
Pti to the set (Uti ) and establish the result of Theorem 1 for this case. A sufficient condition for
Markov perfect equilibrium of this game is given by Theorem 2 where  i are now interpreted
as mappings from Pti to (Uti ) (instead of mappings from Pti to Uti ). Given a Markov perfect
equilibrium ( 1 ,  2 ) of the virtual players game, the equivalent strategies gti (, ) := ti () form
a common information based Markov perfect equilibrium of game G1 in behavioral strategies.
   Further, we can follow a backward induction procedure identical to the one used in section IV-C
(Algorithm 1), but now consider mixed strategy Bayesian Nash equilibria of the one-stage
Bayesian games SGt () constructed there. We proceed as follows:
Algorithm 2:
   1) At the terminal time T , for each realization  of the common information based belief at
       time T , consider the one-stage Bayesian game SGT () defined in Algorithm 1. A mixed
       strategy  i for the game SGT () is a mapping form PTi to (UTi ). A mixed strategy
       Bayesian Nash equilibrium of this game is a pair of strategies  1 ,  2 such that for any
       realization pi ,  i (pi ) assigns zero probability to any action that is not a solution of the
       minimization problem
                                            min
                                             i
                                                E[ci (Xt , ui , Ujt )|PiT = pi ],
                                             u
       where Ujt is distributed according to  j (Pjt ). Since SGt () is a finite Bayesian game, a
       mixed strategy equilibrium is guaranteed to exist [22]. For any mixed strategy Bayesian
       Nash equilibrium  1 ,  2 of SGT (), denote the expected equilibrium costs as VTi () and
       define ti () :=  i , i = 1, 2.
   2) At time t < T , for each realization  of the common information based belief at time t,
       consider the one-stage Bayesian game SGt () defined in Algorithm 1. A mixed strategy
       Bayesian Nash equilibrium of this game is a pair of strategies  1 ,  2 such that for any
       realization pi ,  i (pi ) assigns zero probability to any action that is not a solution of the
       minimization problem
       where Ujt is distributed according to  j (Pjt ) and Zt+1 is the increment in common in-
       formation generated according to (6), (2) and (1) when control actions Uit = ui and Ujt
       distributed according to  j (Pjt ) are used. Since SGt () is a finite Bayesian game, a mixed
       strategy equilibrium is guaranteed to exist [22]. For any mixed strategy Bayesian Nash
       equilibrium  1 ,  2 of SGt (), denote the expected equilibrium costs as Vti () and define
       ti () :=  i , i = 1, 2.
   We can now state the following theorem.
Theorem 4 For the finite game G1, a common information based Markov perfect equilibrium in
behavioral strategies always exists. Further, this equilibrium can be found by first constructing
strategies  1 ,  2 according to the backward inductive procedure of Algorithm 2 and then defining
behavioral strategies g1 , g2 in G1 as
gti (, t ) := ti (t ),
i = 1, 2, t = 1, 2, . . . , T .
VI. D ISCUSSION
A. Importance of Assumption 2
   The most restrictive assumption in our analysis of game G1 is Assumption 2 which states
that the common information based belief is independent of control strategies. It is instructive to
consider why our analysis does not work in the absence of this assumption. Let us consider the
model of Section II with Assumption 1 as before but without Assumption 2. Lemma 1, which
follows from Assumption 1, is still true. For this version of game G1 without Assumption 2, we
can construct an equivalent game with virtual players similar to game G2. Further, it is easy to
show that Theorem 1 which relates equilibria of G2 to those of G1 is still true.
   The key result for our analysis of game G2 in section IV was Lemma 8 which allowed us to
use t as a Markov state and to define and characterize Markov perfect equilibria for the game
G2. Lemma 8 essentially states that the set of Markov decision strategy pairs (that is, strategies
that select prescriptions as a function of t ) is closed with respect to the best response mapping.
In other words, if we start with any pair of Markov strategies ( 1 ,  2 ) for the virtual players and
define i to be the best response of virtual player i to  j , then, for at least one choice of best
response strategies, the pair (1 , 2 ) belongs to the set of Markov strategy pairs. This is true
not just for strategies ( 1 ,  2 ) that form an equilibrium but for any choice of Markov strategies.
We will now argue that this is not necessarily true without Assumption 2.
   Recall that due to Lemma 1, the belief t evolves as
                                                         1      2
                                      t = Ft1 (t1 , t1 , t1 , zt ).
Thus, in order to evaluate the current realization of t , a virtual player must know the prescrip-
tions used by both virtual players. However, the virtual players do not observe each others past
prescriptions since the only data they have available is ct . Thus, a virtual player cannot evaluate
the belief t without knowing (or assuming) how the other player selects its prescriptions.
   Consider now decision strategies ( 1 ,  2 ) for the two virtual players which operate as follows:
At each time t, the prescriptions chosen by virtual players are
ti = ti (t ) (29)
Assume that the above strategies are not a Nash equilibrium for the virtual players game.
Therefore, one virtual player, say virtual player 2, can benefit by deviating from its strategy.
Given that virtual player 1 continues to operate according to (29) and (30), is it possible for
virtual player 2 to reduce its cost by using a non-Markov strategy, that is, a strategy that selects
prescriptions based on more data than just t ? Consider any time t, if virtual player 2 has
                                                        2
deviated to some other choice of Markov decision rules 1:t1 in the past, then the true belief
on state and private information given the common information,
                                             1       2
                                  t = P1:t1 ,1:t1 (xt , p1t , p2t |ct ),
is different from the belief t evaluated by the first player according to (30). (Note that since
past prescriptions are not observed and virtual player 1s operation is fixed by (29) and (30),
virtual player 1 continues to use t evolving according to (30) as its belief.) Even though t is
no longer the true belief, virtual player 2 can still track its evolution using (30). Using arguments
similar to those in the proofs of Lemmas 7 and 8, it can be established that an optimal strategy
for virtual player 2, given that virtual player 1 operates according to (29) and (30), is of the
form t2 = t2 (t , t ), where t is the true conditional belief on state and private information
given the common information whereas t is given by (30). Thus, the best response of player
2 may not necessarily be a Markov strategy and hence Lemma 8 may no longer hold. Without
Lemma 8, we cannot define Markov perfect equilibrium of game G2 using t as the state.
   The game G1 is referred to as a team problem if the two controllers have the same cost
functions, that is, c1 () = c2 () = cteam (). Nash equilibrium strategies can then be interpreted
as person-by-person optimal strategies [23]. Clearly, the results of sections IV and V apply to
person-by-person optimal strategies for team problems as well.
   For team problems, our results can be strengthened in two ways. Firstly, we can find globally
optimal strategies for the controllers in the team using the virtual player approach and secondly,
we no longer need to make Assumption 2. Let us retrace our steps in section IV for the team
problem without Assumption 2:
   1) We can once again introduce virtual players that observe the common information and
       select prescriptions for the controllers. The two virtual players have the same cost function.
       So game G2 is now a team problem and we will refer to it as T2 . It is straightforward
       to establish that globally optimal strategies for virtual player can be translated to globally
       optimal strategies for the controllers in the team in a manner identical to Theorem 1.
   2) Since we are no longer making Assumption 2, the common information belief evolves
       according to
                                                         1      2
                                      t = Ft1 (t1 , t1 , t1 , zt ).                      (31)
                                          2
       Virtual player 1 does not observe t1 , so it cannot carry out the update described in (31).
       However, we will now increase the information available to virtual players and assume
                                                                           1        2
       that each virtual player can indeed observe all past prescriptions 1:t1 , 1:t1 . We refer
       to this team with expanded information for the virtual players as T2.
       It should be noted that the globally optimal expected cost for T2 can be no larger than
       the globally optimal cost of T2 since we have only added information in going from T2
       to T2. We will later show that the globally optimal strategies we find for T2 can be
       translated to equivalent strategies for T2 with the same expected cost.
   3) For T2, since all past prescriptions are observed, both virtual players can evaluate t
                                                           1        2
       using (31) without knowing the past decision rules 1:t1 , 1:t1 . We can now repeat the
       arguments in the proof of Lemma 7 to show that an analogous result is true for team T2
       as well. The team problem for the virtual players is now a Markov decision problem with
       t evolving according to (31) as the Markov state and the prescription pair (t1 , t2 ) as the
decision. We can then write a dynamic program for this Markov decision problem.
       Theorem 5 For the team problem T2 with virtual players, for each realization of t , the
       optimal prescriptions are the minimizers in the following dynamic program:
              VTteam () := min
                            1 2
                                E[cteam (Xt , 1T (P1T ), 2T (P2T ))|T = , 1T =  1 , 2T =  2 ]      (32)
                                  ,
        Vtteam () := min
                      1 2
                          E[cteam (Xt , 1t (P1t ), 2t (P2t )) + Vt+1
                                                                    team
                                                                         (t+1 )|t = , 1t =  1 , 2t =  2 ]
                         ,
                                                                                                             (33)
       where t+1 = Ft (t , 1t , 2t , Zt+1 ).                                                                 
   4) Let t1 () be the minimizer in the right hand side of the definition of Vtteam () in the
       above dynamic program. The globally optimal virtual players operation can be described
       as: At each t, evaluate
                                                                    1      2
                                                 t = Ft1 (t1 , t1 , t1 , zt )                        (34)
ti = ti (t ) i = 1, 2. (35)
       Now, instead of operating according to (34) and (35), assume that virtual players operate
       as follows: At each t, evaluate
                                                             1            2
                                          t = Ft1 (t1 , t1 (t1 ), t1 (t1 ), zt )                 (36)
ti = ti (t ) i = 1, 2. (37)
       It should be clear that virtual players operating according to (36) and (37) will achieve the
       same globally optimal performance as the virtual players operating according to (34) and
       (35). Furthermore, the virtual players in T2 can follow (36) and (37) and thus achieve the
       same globally optimal performance as in T2.
Thus, to find globally optimal strategies for the team of virtual players in absence of As-
sumption 2, we first increased their information to include past prescriptions and then mapped
the globally optimal strategies with increased information to equivalent strategies with original
information.
   For the game G2 in absence of assumption 2, we cannot follow the above approach of first
increasing virtual players information to include past prescriptions, finding equilibrium with
added information and then mapping the equilibrium strategies to equivalent strategies with
original information. To see the reason, let us denote the virtual player operation given by (34)
and (35) by the strategy  i , i = 1, 2 and the virtual player operation given by (36) and (37) by the
strategy  i , i = 1, 2. Then, while it is true that J i ( 1 ,  2 ) = J i ( 1 ,  2 ), i = 1, 2, but for some
other strategies 1 , 2 , it is not necessarily true that J i ( i , j ) = J i ( i , j ), i, j = 1, 2, i 6= j.
Therefore, the equilibrium conditions for  1 ,  2 :
J 1 ( 1 , 2 ) J 1 (1 , 2 ), and J 2 ( 1 , 2 ) J 2 ( 1 , 2 ), (38)
J 1 ( 1 , 2 ) J 1 (1 , 2 ), and J 2 ( 1 , 2 ) J 2 ( 1 , 2 ). (39)
Remark 5 Our dynamic program for the team problem is similar to the dynamic program for
teams obtained in [24] using a slightly different but conceptually similar approach.                             
   The class of common information based Markov perfect equilibria for asymmetric information
games bears conceptual similarities with Markov perfect equilibria of symmetric information
games with perfect state observation. In symmetric information games with perfect state obser-
vation, a controller may be using past state information only because the other controller is using
that information. Therefore, if one controller restricts to Markov strategies, the other controller
can do the same. This observation provides the justification for focusing only on Markov perfect
equilibria for such games. Our results show that a similar observation can be made in our model
of games with asymmetric information. A controller may be using the entire common information
only because other controller is using that information. If one controller chooses to only use the
common information based belief on the state and private information, the other controller can
do the same. Thus, it is reasonable to focus on the class of common information based Markov
perfect equilibria for our model of games with asymmetric information.
   Further, for zero-sum games, the uniqueness of the value of the game implies that the equi-
librium cost of a common information based Markov perfect equilibrium is the same as the
equilibrium cost of any other Nash equilibrium [21].
   For finite games, it is always possible to find pure strategy Nash equilibria (if they exist) by
a brute force search of the set of possible strategy profiles. The number of strategy choices for
                        i                      i
controller i are |U1i ||P1 C1 |  . . .  |UTi ||PT CT | . For simplicity, assume that the set of possible
realizations of private information Pti does not change with time. However, because the common
information is required to be increasing with time (see Assumption 1), the cardinality of the set
possible realization of common information Ct is exponentially increasing with time. Thus, the
number of possible control strategies exhibits a double exponential growth with time.
   Algorithm 1 provides an alternative way for finding an equilibrium by solving a succession
of one stage Bayesian games. But how many such games need to solved? At each time t, we
need to solve a Bayesian game for each possible realization of the belief t . Let Rt denote the
set of possible realizations of the belief t . Since the belief is simply a function of the common
information, we must have that |Rt |  |Ct |. Thus, the total number of one stage games that need
to solved is no larger that Tt=1 |Ct |. Recalling the exponential growth of |Ct |, the number of
                           P
one-stage games to solve shows an exponential growth with time. This is clearly better than the
double exponential growth for the brute force search.
   Two possible reasons may further reduce the complexity of Algorithm 1. Firstly, the set |Rt |
may not be growing exponentially with time (as in the case of the information structure in
Section IV-D, where |Rt | = 3, for all t > 1). Secondly, the one-stage games at time t, SGt ()
may possess enough structure that it is possible to find an equilibrium for a generic  that can be
used to construct equilibrium for all choices of . For finite games, it is not clear what additional
features need to be present in game G1 such that the resulting one-stage games SGt () can be
solved for a generic . In the sequel to this paper we will extend the approach used here to linear
quadratic Gaussian games and show that in these games it is possible to solve the one-stage
games for a generic belief .
   Conceptually, the approach adopted in this paper can be extended to infinite time horizon
games with discounted costs under suitable stationarity conditions. However, in infinite horizon
games, the number of possible realizations of the common information based belief would, in
general, be infinite. Establishing the existence of common information based Markov perfect
equilibria for infinite horizon games would be an interesting direction for future work in this
area.
VIII. ACKNOWLEDGMENTS
   This work was supported in part by the AFOSR MURI Grant FA9550-10-1-0573. The second
author thanks Bharti Center for Telecommunications, IIT Bombay for infrastructural support and
Research Internship in Science and Engineering program of Indo-US Science and Technology
Forum for supporting the visit to Indian Institute of Technology Bombay.
                                                           A PPENDIX A
                                                     P ROOF     OF   L EMMA 1
              X
     =                    1{t+1 (p1t ,p2t ,u1t ,u2t ,yt+1
                                                       1 ,y2 )=z
                                                           t+1
                                                                     1 1 1 1 1               1    1 2 2 2 2               2
                                                                t+1 } {t+1 (pt ,ut ,yt+1 )=pt+1 } {t+1 (pt ,ut ,yt+1 )=pt+1 }
          1 ,y2 ,u1 ,u2
         yt+1 t+1 t   t
       1      2
    P(yt+1 , yt+1 |xt+1 )P(xt+1 |xt , u1t , u2t )1{t1 (p1t )=u1t } 1{t2 (p2t )=u2t } t (xt , p1t , p2t )                  (40)
Note that in addition to the arguments on the left side of conditioning in (40), we only need
t and t1 , t2 to evaluate the right hand side of (40). That is, the joint conditional distribution
on (Xt , P1t , P2t , Xt+1 , P1t+1 , P2t+1 , Zt+1 ) depends only on t , t1 and t2 with no dependence on
control strategies.
   We can now consider the common information based belief at time t + 1,
                                                         A PPENDIX B
                                                    P ROOF    OF   L EMMA 7
                                                                                      1      2
   Consider a realization ct of common information at time t and realizations 1:t , 1:t , 1:t of
beliefs and prescriptions till time t. Because of (10) in Assumption 2, we have
t+1 = Ft (t , Zt+1 )
Recall that
where we used the fact that the control actions are simply the prescriptions evaluated at the
private information. Therefore,
                                   1      2
           P(Zt+1 = z|ct , 1:t , 1:t , 1:t )
                    X
                                                              1      2
           =                         P(Zt+1 = z, xt , xt+1 , yt+1 , yt+1 , p1t , p2t |ct , 1:t , 1:t
                                                                                                   1      2
                                                                                                       , 1:t )
                           1 ,y2 ,p1 ,p2
                xt ,xt+1 ,yt+1 t+1 t   t
                            X
                                                                                                  1     2
                 =                        1{t+1 (p1t ,p2t ,t1 (p1t ),t2 (p2t ),yt+1
                                                                                   1 ,y2 )=z} P(y
                                                                                       t+1       t+1 , yt+1 |xt+1 )
                          1 ,y2 ,p1 ,p2
                     xt ,yt+1 t+1 t   t
                  P(xt+1 |xt , t1 (p1t ), t2 (p2t ))P(xt , p1t , p2t |ct , 1:t , 1:t
                                                                                      1      2
                                                                                          , 1:t )
                            X
                                                                                                  1     2
                 =                                                                 1 ,y2 )=z} P(y
                                          1{t+1 (p1t ,p2t ,t1 (p1t ),t2 (p2t ),yt+1 t+1       t+1 , yt+1 |xt+1 )
                          1 ,y2 ,p1 ,p2
                     xt ,yt+1 t+1 t   t
where we used the fact that P(xt , p1t , p2t |ct , 1:t , 1:t
                                                           1      2
                                                               , 1:t ) = P(xt , p1t , p2t |ct ), since 1:t , 1:t
                                                                                                                1      2
                                                                                                                    , 1:t
are all functions of ct , and the fact that P(xt , p1t , p2t |ct ) =: t (xt , p1t , p2t ). The right hand side in
(44) depends only on t and t1 , t2 . Thus, the conditional probability of Zt+1 = z conditioned
                1      2
on ct , 1:t , 1:t , 1:t depends only on t and t1 , t2 . This establishes (42) and hence the lemma.
                                                        A PPENDIX C
                                                   P ROOF     OF   L EMMA 8
If virtual player 2 selects t2 , the expected instantaneous cost for the virtual player 2 is
                     E[c2 (Xt , U1t , U2t )|ct ] = E[c2 (Xt , t1 (P1t ), t2 (P2t ))|ct ]
                          X
                     =          c2 (xt , t1 (p1t ), t2 (p2t ))P(xt , p1t , p2t |ct )
                         xt ,p1t ,p2t
                           X
                     =                  c2 (xt , t1 (p1t ), t2 (p2t ))t (xt , p1t , p2t ) =: c2 (t , t2 )    (45)
                         xt ,p1t ,p2t
Thus, given the fixed strategy of virtual player 1, the instantaneous expected cost for virtual
player 2 depends only on the belief t and the prescription selected by virtual player 2. Given
the controlled Markov nature of t , it follows that virtual player 2s optimization problem is a
Markov decision problem with t as the state and hence virtual player 2 can optimal select its
prescription as a function of t . This completes the proof of the lemma.
                                                           A PPENDIX D
                                                  P ROOF     OF   T HEOREM 2
   Consider a strategy pair ( 1 ,  2 ) that satisfies the conditions of the theorem. For any 1 
k  T and any realization ck of the common information at time k, we want to show that the
strategies form a Nash equilibrium of the sub-game starting from time k with the costs given as
                                                     T
                                                    hX                            i
                                                E          ci (Xt , U1t , U2t )|ck ,                               (46)
                                                     t=k
                                                     A PPENDIX E
                                              P ROOF    OF   T HEOREM 3
   Consider any realization  of the common information based belief and consider a Bayesian
Nash equilibrium  1 ,  2 of the game SGT (). We will show that  1 ,  2 satisfy the value
function conditions for time T in Theorem 2. By definition of Bayesian Nash equilibrium, for
every realization p1 of P1T ,
E [c1 (XT , 1 (P1T ), 2 (P2T ))|P1T = p1 ] E [c1 (XT , 1 (P1T ), 2 (P2T ))|P1T = p1 ], (47)
= E [c1 (XT , 1 (P1T ), 2 (P2T ))] E [c1 (XT , 1 (P1T ), 2 (P2T ))], (48)
where all the expectations are with respect to the belief on (XT , P1T , P2T ). Similarly,
E [c2 (XT , 1 (P1T ), 2 (P2T ))] E [c2 (XT , 1 (P1T ), 2 (P2T ))], (49)
for any choice of  2 . Thus, Ti () :=  i , i = 1, 2 satisfy the conditions in (20) and (21) when
T = .
   Similarly, for any time t < T , consider any realization  of the common information based
belief at t and consider a Bayesian Nash equilibrium  1 ,  2 of the game SGt (). Then, by
definition of Bayesian Nash equilibrium, for every realization p1 and any choice of  1 , we have
that the expression
(where Zt+1 is the increment in common information generated according to (6), (2) and (1)
when control actions U1t =  1 (pi ) and U2t =  2 (P2
                                                       t ) are used) can be no larger than
(where Zt+1 is the increment in common information generated according to (6), (2) and (1)
when control actions U1t =  1 (pi ) and U2t =  2 (P2
                                                       t ) are used. Similar conditions hold for
R EFERENCES
 [1] L. S. Shapley, Stochastic games, Proc. Natl. Acad. Sci. USA, vol. 39, pp. 10951100, 1953.
 [2] M. J. Sobel, Noncooperative stochastic games, The Annals of Mathematical Statistics, vol. 42, no. 6, pp. 19301935,
     1971. [Online]. Available: http://www.jstor.org/stable/2240119
 [3] D. Fudenberg and J. Tirole, Game Theory.      MIT Press, 1991.
 [4] T. Basar and G. J. Olsder, Dynamic Non-cooperative Game Theory.       SIAM Series in Classics in Applied Mathematics,
     Philadelphia, 1999.
 [5] J. Filar and K. Vrieze, Competitive Markov Decision Processes.     Springer, 1996.
 [6] R. Behn and Y.-C. Ho, On a class of linear stochastic differential games, IEEE Trans. Autom. Contr., vol. 13, no. 3, pp.
     227  240, Jun 1968.
 [7] I. Rhodes and D. Luenberger, Differential games with imperfect state information, IEEE Trans. Autom. Contr., vol. 14,
     no. 1, pp. 29  38, Feb 1969.
 [8] W. Willman, Formal solutions for a class of stochastic pursuit-evasion games, IEEE Trans. Autom. Contr., vol. 14, no. 5,
     pp. 504  509, Oct 1969.
 [9] Y. C. Ho, On the minimax principle and zero-sum stochastic differential games, Journal of Optimization Theory and
     Applications, vol. 13, no. 3, pp. 343361, 1974.
[10] T. Basar and M. Mintz, A multistage pursuit-evasion game that admits a Gaussian random process as a maximin control
     policy, Stochastics, vol. 1:1-4, pp. 2569, 1973.
[11] T. Basar, Two-criteria LQG decision problems with one-step delay observation sharing pattern, Information and Control,
     vol. 38, pp. 2150, 1978.
[12] , Decentralized multicriteria optimization of linear stochastic systems, IEEE Trans. Autom. Contr., vol. 23, no. 2,
     pp. 233  243, Apr. 1978.
[13] E. Altman, V. Kambley, and A. Silva, Stochastic games with one step delay sharing information pattern with application
     to power control, in Proceedings of International Conference on Game Theory for Networks, GameNets09, May 2009,
     pp. 124129.
[14] J. Hespanha and M. Prandini, Nash equilibria in partial-information games on Markov chains, in Proc. of the 40th IEEE
     Conference on Decision and Control, 2001, pp. 21022107.
[15] T. Basar, On the saddle-point solution of a class of stochastic differential games, Journal of Optimization Theory and
     Applications, vol. 33, no. 4, pp. 539556, 1981.
[16] H. Cole and N. Kocherlakota, Dynamic games with hidden actions and hidden states, Journal of Economic Theory,
     vol. 98, no. 1, pp. 114126, 2001.
[17] A. Nayyar, A. Mahajan, and D. Teneketzis, Optimal control strategies in delayed sharing information structures, IEEE
     Transactions on Automatic Control, vol. 57, no. 7, pp. 16061620, July 2011.
[18] A. Nayyar and T. Basar, Dynamic stochastic games with asymmetric information, accepted in 51st IEEE Conference on
     Decision and Control, 2012.
[19] P. R. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identification and Adaptive Control. Prentice Hall, Englewood
     Cliffs, NJ, 1986.
[20] E. Maskin and J. Tirole, Markov perfect equilibrium: I. observable actions, Journal of Economic Theory, vol. 100,
     no. 2, pp. 191  219, 2001. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0022053100927856
[21] M. J. Osborne and A. Rubinstein, A Course in Game Theory.        MIT Press, 1994.
[22] R. B. Myerson, Game Theory: Analysis of Conflict.    Harvard University Press, Cambridge, MA, 1997.
[23] Y.-C. Ho, Team decision theory and information structures, Proc. IEEE, vol. 68, no. 6, pp. 644654, 1980.
[24] A. Nayyar, A. Mahajan, and D. Teneketzis, Decentralized stochastic control with partial sharing information structures:
     A common information approach, IEEE Transacions on Automatic Control, Dec 2011, submitted.
Supplementary Material
                                                                A PPENDIX F
                                                             P ROOF   OF   L EMMA 2
        Proof: It is straightforward to verify that the structure of common and private information
                                                                                   1      2
satisfies Assumption 1. We focus on the proof for Assumption 2. For a realization y1:t , y1:t ,u11:t , u21:t
of the common information at time t + 1, the common information based belief can be written
as
                1      2                     1       2     1      1      2      2     1      2
  t+1 (xt+1 , yt+1 , yt+1 ) = Pg1:t ,g1:t (Xt+1 = xt+1 , Yt+1 = yt+1 , Yt+1 = yt+1 |y1:t , y1:t , u11:t , u21:t )
          1      1                    2      2
     = P(Yt+1 = yt+1 |Xt+1 = xt+1 )P(Yt+1 = yt+1 |Xt+1 = xt+1 )
              1   2               1      2
      Pg1:t ,g1:t (Xt+1 = xt+1 |y1:t , y1:t , u11:t , u21:t )
          1      1                       2           2
     = P(Yt+1 = yt+1 |Xt+1 = xt+1 )P(Yt+1     = yt+1      |Xt+1 = xt+1 )
       X h                                      1      2
                                                                                                 i
                                                                    1
          P(Xt+1 = xt+1 |Xt = xt , u1t , u2t )Pg1:t ,g1:t (Xt = xt |y1:t    2
                                                                         , y1:t , u11:t , u21:t ) ,                     (50)
         xt
where we used the dynamics and observation model to get the expression in (50). It can now be
argued that in the last term in (50), we can remove the terms u1t , u2t in the conditioning since
                                         1      2
they are functions of the rest of terms y1:t , y1:t , u11:t1 , u21:t1 in the conditioning. The last term
in (50) would then be
                                                 1       2        1      2
                                           Pg1:t ,g1:t (Xt = xt |y1:t , y1:t , u11:t1 , u21:t1 ),
                                                            1      2
which is known to be independent of choice of control laws g1:t , g1:t [19]. Thus, t+1 is inde-
pendent of the choice of control laws. For the sake of completeness, we provide a more detailed
argument below.
     The last term in (50) can be written as
                          1      2           1      2
                      Pg1:t ,g1:t (Xt = xt |y1:t , y1:t , u11:t , u21:t )
                                 1   2         1      2
                      = Pg1:t ,g1:t (Xt = xt |y1:t , y1:t , u11:t1 , u21:t1 )
                                 1   2
                              Pg1:t ,g1:t (Xt = xt , Yt1 = yt1 , Yt2 = yt2 |y1:t1
                                                                                1        2
                                                                                     , y1:t1  , u11:t1 , u21:t1 )
                      =                  1  2                            1
                                    Pg1:t ,g1:t (Yt1 = yt1 , Yt2 = yt2 |y1:t1    2
                                                                               , y1:t1 , u11:t1 , u21:t1 )
                        t (xt , yt1 , yt2 )
                      =P              1     2
                                                                                                                        (51)
                        xt t (xt , yt , yt )
   Combining (50) and (51) establishes that t+1 is a function only of t and zt+1 = (yt1 , yt2, u1t , u2t ).
Further, the transformation form (t , zt+1 ) to t+1 does not depend on the choice of control
strategies.
                                                            A PPENDIX G
                                                 P ROOF         OF   L EMMA 3
      Proof: It is straightforward to verify that the structure of common and private information
                                                                                   1        2
satisfies Assumption 1. We focus on the proof for Assumption 2. For a realization y1:t+1 , y1:t ,
u11:t , u21:t of the common information at time t + 1, the common information based belief can be
written as
                2             1   2                 2      2     1        2
  t+1 (xt+1 , yt+1 ) = Pg1:t ,g1:t (Xt+1 = xt+1 , Yt+1 = yt+1 |y1:t+1 , y1:t , u11:t , u21:t )
       2      2                                 1       2  1        2
  = P(Yt+1 = yt+1 |Xt+1 = xt+1 )Pg1:t ,g1:t (Xt+1 = xt+1 |y1:t+1 , y1:t , u11:t , u21:t )
                                                    1       2  1      1     1      2
         2            2           Pg1:t ,g1:t (Xt+1 = xt+1 , Yt+1 = yt+1 |y1:t , y1:t , u11:t , u21:t )
  =        =
      P(Yt+1         yt+1 |Xt+1
                         = xt+1 ) P g1 ,g2                      1      1     1      2      1       2
                                                                                                        (52)
                                   xP
                                          1:t 1:t (X
                                                     t+1 = x, Yt+1 = yt+1 |y1:t , y1:t , u1:t , u1:t )
The numerator in the second term in (52) can be written as
                    1      1                                    1    2  1      2
                 P(Yt+1 = yt+1 |Xt+1 = xt+1 )Pg1:t ,g1:t (Xt+1 = xt+1 |y1:t , y1:t , u11:t , u21:t )
                 1     1
           = P(Yt+1 = yt+1 |Xt+1 = xt+1 )
           X  h                                      1     2
                                                                                                           i
                                                                          1
               P(Xt+1 = xt+1 |Xt = xt , u1t , u2t )Pg1:t ,g1:t (Xt = xt |y1:t    2
                                                                              , y1:t , u11:t1 , u21:t1 )
            xt
             1     1
       = P(Yt+1 = yt+1 |Xt+1 = xt+1 )
                                                  1     2
       Xh                                       Pg1:t ,g1:t (Xt = xt , yt2 |y1:t1      2
                                                                                   , y1:t1  , u11:t1 , u21:t1 ) i
           P(Xt+1 = xt+1 |Xt = xt , u1t , u2t )            1  2
       xt
                                                      Pg1:t ,g1:t (yt2 |y1:t
                                                                         1      2
                                                                             , y1:t1 , u11:t1 , u21:t1 )
             1     1
                                        Xh                                                     t (xt , yt2 ) i
       = P(Yt+1 = yt+1 |Xt+1 = xt+1 )           P(Xt+1 = xt+1 |Xt = xt , u1t , u2t )                     2
                                                                                                                     (53)
                                         x
                                                                                                   t (yt  )
                                                        t
Similar expressions can be obtained for the denominator of the second term in (52) to get
                        2           2       2
          t+1 (xt+1 , yt+1 ) = P(Yt+1  = yt+1 |Xt+1 = xt+1 )
                1        1
                                           P h                                                    i
          P(Yt+1    = yt+1   |Xt+1 = xt+1 ) xt P(Xt+1 = xt+1 |Xt = xt , u1t , u2t )t (xt , yt2 )
                   1         1
                                           P h                      1     2            2
                                                                                            i
             P(Yt+1 = yt+1 |Xt+1 = x) xt P(Xt+1 = x|Xt = xt , ut , ut )t (xt , yt )
                        1
           =: Ft (t , yt+1 , yt2 , u1t , u2t ) = Ft (t , zt+1 )                                                  (54)
                                                            A PPENDIX H
                                                       P ROOF    OF   L EMMA 4
                         2                       1       2          2         1      2
               t (xt , ytd+1:t ) = Pg1:t1 (Xt = xt , Ytd+1:t = ytd+1:t |y1:t , y1:td , u11:t1 )
                     X h
                                   2           2
                =            P(Ytd+1:t    = ytd+1:t |Xtd+1:t1 = xtd+1:t1 , Xt = xt )
                     xtd:t1
                      1
                                                                                            i
                Pg1:t1 (Xt = xt , Xtd:t1 = xtd:t1 |y1:t
                                                           1      2
                                                               , y1:td , u11:t1 )                       (55)
The first term in (55) depends only on the noise statistics. To see how the second term in
(55) is strategy independent, consider a centralized stochastic control problem with controller
1 as the only controller where the state process is Xt := (Xtd:t ), the observation process is
Yt := (Yt1 , Ytd
               2
                   ). The second term in (55) is simply the information state P(Xt |y1:t , u11:t1 )
of this centralized stochastic control problem which is known to be strategy independent and
satisfies an update equation of the form required by Lemma 4 [19].
   Case B: Using arguments similar to those in Case A, the common information based belief
                           1     2
t for a realization y1:t1   , y1:td , u11:t1 of the common information can be written as:
                             X h
t (xt , yt1 , ytd+1:t
                2
                        )=          P(Yt1 = yt1 , Ytd+1:t
                                                      2       2
                                                           = ytd+1:t |Xtd+1:t1 = xtd+1:t1 , Xt = xt )
                                 xtd:t1
     1
                                                                                  i
                                            1
 Pg1:t1 (Xt = xt , Xtd:t1 = xtd:t1 |y1:t1    2
                                                  , y1:td , u11:t1 )                                    (56)
                                                             A PPENDIX I
                                                          P ROOF   OF    L EMMA 5
                      0
   For a realization y1:t+1 , u11:t , u21:t of the common information at time t + 1, the belief t+1 is
given as
                                         1       2
       t+1 (x0 , x1 , x2 ) = Pg1:t1 ,g1:t1 (Xt+1
                                                0
                                                    = x0 , Xt+1
                                                            1
                                                                = x1 , Xt+1
                                                                        2
                                                                            = x2 |y1:t+1
                                                                                   0
                                                                                         , u11:t , u21:t )         (58)
          1      2
                         0
       Pg1:t1 ,g1:t1 (Xt+1 = x0 , Xt+1
                                     1
                                         = x1 , Xt+1
                                                 2
                                                     = x2 , Yt+1
                                                              0     0
                                                                 = yt+1   0
                                                                        |y1:t , u11:t , u21:t )
   =                                 1       2
                                 Pg1:t1,g1:t1 (Yt+1
                                                   0     0
                                                      = yt+1   0
                                                             |y1:t , u11:t , u21:t )
                                                      1      2
         0     0     0
     P(Yt+1 = yt+1 |Xt+1  = x0t+1 )Pg1:t1,g1:t1 (Xt+1
                                                    0
                                                         = x0 , Xt+1 1
                                                                         = x1 , Xt+1
                                                                                  2
                                                                                       = x2 |y1:t  0
                                                                                                     , u11:t , u21:t )
   =                     0       0      0            1
                                                    g1:t1   2
                                                           ,g1:t1     0        0
                                                                                   , u11:t , u21:t )
               P
                  x P(Yt+1 = yt+1 |Xt+1 = x)P                      (Xt+1  = x|y1:t
                                                                                                                   (59)
The control strategy dependent term in the numerator in (59) can be written as
                        1        2      0
                     Pg1:t1 ,g1:t1 (Xt+1 = x0 , Xt+1
                                                    1
                                                       = x1 , Xt+1
                                                                2
                                                                   = x2 |y1:t
                                                                           0
                                                                              , u11:t , u21:t )
                        Xh
                                      0
                     =          P(Xt+1    = x0 , Xt+1
                                                  1
                                                      = x1 , Xt+1
                                                              2
                                                                  = x2 |Xt0 = x , u1t , u2t )
                            x
                         1       2
                                                                             i
                      Pg1:t1 ,g1:t1 (Xt0 = x |y1:t
                                                    0
                                                       , u11:t1 , u21:t1 )
                        X
                                     0
                     =       P(Xt+1      = x0 , Xt+1
                                                  1
                                                        = x1 , Xt+12
                                                                        = x2 |Xt0 = x , u1t , u2t )t (x )       (60)
                            x
Similarly, the control strategy dependent term in the denominator in (59) can be written as
         1     2
                                                        X
                       0        0
      Pg1:t1,g1:t1 (Xt+1 = x|y1:t , u11:t , u21:t ) =      0
                                                          P(Xt+1 = x|Xt0 = x , u1t , u2t )t (x ) (61)
                                                                   x
                                                             A PPENDIX J
                                                          P ROOF   OF    L EMMA 6
follows:
Note that in addition to the arguments on the left side of conditioning in (62), we only need t
to evaluate the right hand side of (62).
   We can now consider the common information based belief at time t + 1,