Part A: Regression and causality
A2: Potential outcomes and RCTs
            Kirill Borusyak
      ARE 213 Applied Econometrics
         UC Berkeley, Fall 2024
                                     1
Outline
1   The concept of potential outcomes
2   Causal parameters and their identification via RCTs
3   Limitations of the Rubin causal model and alternatives
4   Causality or prediction?
Rubin causal model
   Consider some population of units i
   Each unit is observed one of several treatment conditions Di ∈ D
     ▶   E.g. D ∈ {0, 1}: untreated and treated
   Suppose we can imagine each unit under all possible conditions (in the same
   period)
     ▶   Causality always requires specifying alternatives
     ▶   Corresponding potential outcomes are {Yi (d) : d ∈ D}
           ⋆   e.g. (Yi (0), Yi (1)) (equivalently written as (Y0i , Y1i ))
           ⋆   e.g. demand function
     ▶   Causal effects Yi (d′ ) − Yi (d) are defined by this abstraction
     ▶   Writing Yi (d) encodes a possibility that Di impacts Yi
     ▶   Realized outcome: Yi = Yi (Di )
                                                                                 2
What can be a cause/treatment?
Is it meaningful to say “She did not get this position because she is a woman” (example
from Imbens 2020)?
                                                                                          3
What can be a cause/treatment?
Imagining each unit under all possible conditions is non-trivial:
           “No causation without [imagining] manipulation” (Holland & Rubin)
 1. “She did not get this position because she is a woman” 7
       ▶   Gender is an attribute, not a cause; same for race
       ▶   “She got an orchestra job because of a gender-blind audition” (cf. Goldin and
           Rouse 2000) 3
 2. “She did well on the exam because she was coached by her teacher” (Holland
    1986) 3
       ▶   “She did well on the exam because she studied for it” (Holland 1986) 7
                                                                                           4
SUTVA (1)
In writing Yi (di ) we implicitly imposed SUTVA (“stable unit treatment value
assumption”)
    Most common meaning: no unmodeled interference
       ▶   I.e., treatment statuses of other units, d−i do not affect Yi
       ▶   Frequently violated: e.g. vaccines and infectious disease; information and
           technology adoption; equilibrium effects via prices
    Allowing for interference, we’d write Yi (d1 , . . . , dN ) for the population of size N
       ▶   We may be interested in own-treatment effects Yi (d′i , d−i ) − Yi (di , d−i ) and various
           spillover effects, e.g. Yi (di , 1, . . . , 1) − Yi (di , 0, . . . , 0)
    No interference is an exclusion restriction: Yi (di , d−i ) = Yi (di , d′−i ) ≡ Yi (di ),
    ∀di , d−i , d′−i
                                →
                                −                                 →
                                                                  −              ∑
    Intermediate case: e.g. Yi ( d i ) for exposure mapping d i = (di , k∈Friends(i) dk )
                                                                                                        5
SUTVA (2)
   Additional meaning of SUTVA: D summarizes everything about the intervention
   that is relevant for the outcome
   Example 1: “She got a high wage because she studied for many years”
     ▶   In writing Yi (d), we implicitly assume that school quality does not matter
     ▶   To think through violations, we could start from Yi (years, quality)
   Example 2: D = Herfindahl index of migration origins in a destination region,
   capturing migrant diversity
     ▶   Assumes that this index summarizes everything about exposure to migration
   Defining treatment variables is imposing a causal model. Don’t take it lightly!
                                                                                       6
Effects of causes vs. causes of effects
Statistical analysis focuses on effects of causes (treatments) rather than causes of
effects (outcomes)
     Causes are not clearly defined
                                  (Holland 1986, p.959)
                                                                                       7
Outline
1   The concept of potential outcomes
2   Causal parameters and their identification via RCTs
3   Limitations of the Rubin causal model and alternatives
4   Causality or prediction?
Common causal parameters (1)
   We cannot learn the causal effect Yi (1) − Yi (0) for any particular unit
     ▶   “Fundamental problem of causal inference”: multiple potential outcomes are never
         observed at once
     ▶   ... but we can sometimes learn some averages
   Average treatment/causal effect: ATE = E [Yi (1) − Yi (0)]
     ▶   ATE = E [Yi (1)] − E [Yi (0)]
     ▶   Yi (1) − Yi (0) is never observed but Yi (1) and Yi (0) are: for some but not all units
     ▶   Causal inference can be understood as imputing missing data: e.g. from
         E [Yi (1) | Di = 1] we try to learn E [Yi (1) | Di = 0] and thus E [Yi (1)]
   Conditional average treatment effect E [Yi (1) − Yi (0) | Xi = x], for predetermined
   covariates Xi
                                                                                                   8
Common causal parameters (2)
   Average effect on the treated: ATT = E [Yi (1) − Yi (0) | Di = 1] (a.k.a. TOT, TT)
     ▶   Parameter depends on how selection into treatment happened
     ▶   Yields the aggregate effect of the treatment: Pop Size · P(Di = 1) · ATT
   Average effect on the untreated: ATU = E [Yi (1) − Yi (0) | Di = 0]
   All these parameters follow from the distribution of (Y(1), Y(0), D). But are they
   identified from data on (Y, D)?
                                                                                        9
Identifying ATT & ATE
          ATT = E [Y1 | D = 1] − E [Y0 | D = 1]
                = (E [Y1 | D = 1] − E [Y0 | D = 0])                  (Difference in means)
                  − (E [Y0 | D = 1] − E [Y0 | D = 0])                      (Selection bias)
   Thus, βOLS = E [Y | D = 1] − E [Y | D = 0] = ATT + Selection bias
     ▶   Selection bias = 0 iff Y0 is mean-independent of D
   ATE = ATT iff (Y1 − Y0 ) is mean-independent of D
     ▶   Simple regression identifies ATE and ATT in a randomized control trial (RCT)
         where (Y0 , Y1 ) ⊥
                          ⊥ D by design
     ▶   Regression with any (fixed set of) predetermined controls X also identifies ATE by
         FWL or OVB logic
                                                                                              10
Connecting to linear models
   With a binary treatment, the potential outcomes model implies
                       Yi = Y0i (1 − Di ) + Y1i Di = β0 + β1i Di + εi
   where β0 = E [Y0 ], β1i = Y1i − Y0i and εi = Y0i − E [Y0 ]
   With homogeneous effects, Yi = β0 + β1 Di + εi becomes a causal model where
   Y1i − Y0i ≡ β1 (regardless of whether εi ⊥
                                            ⊥ Di ; think IV)
   With heterogeneous effects, can rederive our result about RCTs: if (εi , β1i ) ⊥
                                                                                  ⊥ Di
   and denoting µ = E [Di ],
               E [(Di − µ) Yi ]   Cov [Di , εi ] E [Di (Di − µi ) β1i ]
      βOLS =                    =               +                       = E [βi ] ≡ ATE
                   Var [Di ]       Var [Di ]           Var [Di ]
                                                                                          11
RCT with ordered or continuous treatments
Consider a RCT where D takes more than two values (e.g. different dosages)
     ⊥ {Y(d)}d∈D =⇒ E [Y | D = d] = E [Y(d) | D = d] = E [Y(d)]
    D⊥
    A saturated regression of Y on dummies for all values of D (or a nonparametric
    regression with continuous D) traces the average structural function E [Y(d)]
    A simple regression of Y on D identifies a convexly-weighted average of
    ∂E [Y(d)] /∂d (or its discrete version):
                               [     ]                      [ [        ] ]
                 ∫ ∞        ∂E Y(d̃)                    Cov 1 D ≥ d̃ , D
         βOLS =       ω(d̃)            dd̃,     ω(d̃) =
                  −∞            ∂ d̃                          Var [D]
                 ∑K
                        E [Y(dk ) − Y(dk−1 )]          (dk − dk−1 ) Cov [1 [D ≥ dk ] , D]
      or βOLS =      ωk                       , ωk =
                 k=1
                              dk − dk−1                             Var [D]
                                                                                            12
The importance of convex weighting
   Imagine E [Y(0)] = 0, E [Y(1)] = 3, E [Y(2)] = 4
     ▶   Higher dosage is always good (on average)
                                                                    [               ]
                                                                        Y(2)−Y(1)
   OLS of Y on D in an RCT will produce a coefficient between E                          = 1 and
     [           ]                                                         2−1
       Y(1)−Y(0)
   E      1−0
                   =3
   An estimator without convex weighting may not: e.g.
                      2 · E [Y(2) − Y(1)] − 1 · E [Y(1) − Y(0)] = −1,
   as if higher treatment is bad
     ▶   Convex weighting avoids sign reversals. Defines a weakly causal estimand.
                                                                                                  13
Distribution of gains
Heckman, Lalonde, Smith (1999) list some other interesting parameters:
 1. How widely are the gains distributed?
      a. The proportion of people taking the program who benefit from it:
         P(Y1 > Y0 | D = 1)
      b. Median gains among participants (and other quantiles)
 2. Does the program help the lower tail?
      a. Distribution of gains by untreated value: e.g. E [Y1 − Y0 | Y0 = ȳ, D = 1]
      b. Increase in % above a threshold due to a policy:
         P(Y1 > ȳ | D = 1) − P(Y0 > ȳ | D = 1)
                                                                                       14
Distribution of gains: Identification
Does an RCT identify these other parameters, e.g. median gains?
    Not without extra restrictions!
    E.g. imagine an RCT where Y takes values 0, 1, 2 with equal prob. in both treated
    and control groups
    This is consistent with (Y0 , Y1 ) taking values (0, 0) , (1, 1) , (2, 2) with equal prob.
       ▶   No casual effect for anyone. Median gain = 0
    Or with (Y0 , Y1 ) taking values (0, 1) , (1, 2) , (2, 0) with equal prob.
       ▶   Median gain = 1
Exception: P(Y1 > ȳ | D = 1) − P(Y0 > ȳ | D = 1) is identified — how?
                                                                                                 15
Outline
1   The concept of potential outcomes
2   Causal parameters and their identification via RCTs
3   Limitations of the Rubin causal model and alternatives
4   Causality or prediction?
Criticisms by Heckman and Vytlacil 2007
 1. Estimated effects cannot be transferred to new environments (limited external
    validity) and to new programs never previously implemented
      ▶   Interventions are black boxes, with little attempt to unbundle their components
      ▶   Mechanisms are not possible to pin down
      ▶   Knowledge does not cumulate across studies (contrast with estimates of a labor
          supply elasticity — a structural parameter)
            ⋆   Counterpoint from Angrist and Pischke (2010): “Empirical evidence on any given
                causal effect is always local, derived from a particular time, place, and research
                design. Invocation of a superficially general structural framework does not make the
                underlying variation more representative. Economic theory often suggests general
                principles, but extrapolation to new settings is always speculative. A constructive
                response to the specificity of a given research design is to look for more evidence, so
                that a more general picture begins to emerge.”
                                                                                                          16
Criticisms by Heckman and Vytlacil 2007 (cont.)
 2. Estimands need not be relevant even to analyze the observed policy
      ▶   Informative on whether to throw out the program entirely (ATT) and whether to
          extend it forcing it on everyone not covered yet (ATU)
      ▶   But not whether to extend/shrink it on the margin (Heckman et al. 1999, Sec.3.4)
      ▶   Or a policy change that affects the assignment mechanism, e.g. available options
      ▶   No analysis from the social planner’s point of view, e.g. accounting for externalities
      ▶   No analysis of causal parameters other than means, e.g. median gains
    Optional exercise: read Heckman-Vytlacil’s Sec. 4.4. Do you agree with
    everything?
                                                                                                   17
Roy model
Alternative “structural” approach: to model self-selection explicitly
     Original Roy (1951) model: self-selection based on outcome comparison
       ▶   D = choice of occupation (e.g. agriculture vs not) or education level
       ▶   Y(d) = earnings for a given occupation/education
       ▶   People vary by occupational productivities/returns to education, known to them
       ▶   They choose based on them: D = arg maxd∈D Yi (d), perhaps with homogeneous
           costs
     Extended Roy model: costs are heterogeneous but fully determined by observables
       ▶   which may or may not affect the outcome at the same time
     Generalized Roy model: self-selection based on unobserved preferences
       ▶   D = arg maxd∈D Ri (d) where e.g. Ri (d) = Yi (d) − Ci (d) for costs Ci (d)
                                                                                            18
Roy model: Identification
What does this structure buy us?
   No free lunch: “for general skill distributions [i.e., without parametric restrictions],
   the [original Roy] model is not identified [from a single cross-section] and has no
   empirical content” (Heckman and Honore 1990)
     But with more data and restrictions can identify the ATE and even the distribution
     of (Y0 , Y1 , R1 − R0 ) =⇒ distribution of gains
     Assumptions are often parametric: e.g. Heckman correction via normality of
     potential outcomes
       ▶   Not living up to the goal of using economic theory for identification?
     Can do better with cost shifters that shift selection but not outcomes
       ▶   Value over traditional IV methods is not so clear?
                                                                                              19
Another alternative: Directed acyclic graphs (DAGs)
Directed acyclic graphs of Judea Pearl represent causal relationships graphically: e.g.
                               X = soil treatment (fumigation)
                               Y = crop yield
                               Z1 = eelworm population before the treatment
                               Z2 = eelworm population after the treatment
                               Z3 = eelworm population at the end of season
                               Z0 = eelworm population last season (unobserved)
                               B = bird population (unobserved)
     “Do-calculus” allows to verify whether the average total effect of X on Y is
     identified from observing (X, Y, Z1 , Z2 , Z3 )
     Popular in epidemiology but not in economics. Why?
                                                                                          20
Some limitations of DAGs
Imbens (JEL 2020) lists some pitfalls of DAGs relative to potential outcomes:
 1. Economists avoid complex models with many variables
 2. Randomization and manipulability have no special value in DAGs
 3. Too nonparametric:
      a. Not possible to incorporate additional assumptions, such as continuity (important
         in RDD) and monotonicity (important in IV) (see Maiti, Plecko, Bareinboim 2024)
      b. Too much focus on identification, relative to estimation and inference
 4. Difficult to model interference
 5. Clunky to model simultaneity, e.g. demand and supply
                                                                                             21
Outline
1   The concept of potential outcomes
2   Causal parameters and their identification via RCTs
3   Limitations of the Rubin causal model and alternatives
4   Causality or prediction?
Causality vs. prediction
    Economists obsess with causality but sometimes prediction is the relevant goal
    The choice should be guided by the ultimate goal: decision making
    Two scenarios (see Kleinberg et al., 2015):
 1. The action/policy D ∈ {0, 1} affects the outcome Y, and the payoff (i.e., utility) π
    depends on Y
      ▶   E.g. D = rain dance in a drought, Y = it rains
                 π(d) = aY(d) − bd     =⇒    E [π(1) − π(0)] = aE [Y(1) − Y(0)] − b
      ▶   Optimal decision: D = 1 [E [Y(1) − Y(0)] ≥ b/a]
      ▶   This is a causal problem. Running an RCT is very helpful
      ▶   Better knowledge of heterogeneous causal effects E [Y(1) − Y(0) | X] based on
          observed covariates X also yields better decisions D(X)
                                                                                           22
Causality vs. prediction (2)
 2. Y is unaffected by D but the marginal payoff of actions, ∂π/∂D, depends on Y
      ▶   E.g. D = take an umbrella, Y = it rains
                       π(d) = aY · d − bd    =⇒     E [π(1) − π(0)] = aE [Y] − b
      ▶   Optimal decision: D = 1 [E [Y] ≥ b/a]
      ▶   This is a prediction problem. Running an RCT is not helpful
      ▶   Better prediction E [Y | X] yields better decisions D(X)
    Note: This scenario can also be recast as a causal problem:
      ▶   D affects Ỹ(D) = you get wet = Y · (1 − D)
      ▶   But we know potential outcome Ỹ(1) = 0
      ▶   And we have data on Ỹ(0) = Y to make a prediction of Ỹ(1) − Ỹ(0)
                                                                                   23
Policy-relevant prediction problems: Examples
 1. Eliminating futile hip and knee replacement surgeries
      ▶   Surgery has costs: monetary + painful recovery
      ▶   Benefits depend on life expectancy
      ▶   Kleinberg et al. (2015) show 10% (1%) of patients have predictable probability of
          dying within a year of 24% (44%) for reasons unrelated to this surgery
 2. Improving admissions by predicting college success
      ▶   Geiser and Santelices (2007) show that high-school GPA is a better predictor of
          performance at UC colleges than SAT
      ▶   If UC had to reduce admissions, rejecting applicants with marginal GPAs would
          result in losing fewer good students than rejecting marginal SAT applicants
 3. See Kleinberg et al. “Human Decisions and Machine Predictions” (2018) for a
    more subtle example on bail decisions by judges
                                                                                              24