Development planning and project analysis II
Chapter five
                                   5. Impact assessment basics
                                         5.1. Introduction
The rationale of a program in drawing public resources is to improve a selected outcome over
what it would have been without the program. An evaluator’s main problem is to measure the
impact or effects of an intervention so that policy makers can decide whether the program
intervention is worth supporting and whether the program should be continued, expanded, or
disbanded.
Impact evaluation is an effort to understand whether the changes in well-being are indeed due to
project or program intervention. Specifically, impact evaluation tries to determine whether it is
possible to identify the program effect and to what extent the measured effect can be attributed to
the program and not to some other causes. Impact evaluation focuses on the latter stages of the
log frame of M&E, which focuses on outcomes and impacts.
Impact evaluation is not imperative for each and every project. Impact evaluation is time and
resource intensive and should therefore be applied selectively. Policy makers may decide
whether to carry out an impact evaluation on the basis of the following criteria:
    The program intervention is innovative and of strategic importance.
    The impact evaluation exercise contributes to the knowledge gap of what works and what
     does not. (Data availability and quality are fundamental requirements for this exercise.)
An operational evaluation seeks to understand whether implementation of a program unfolded as
planned. The rationale of a program in drawing public resources is to improve a selected
outcome over what it would have been without the program. An evaluator’s main problem is to
measure the impact or effects of an intervention so that policy makers can decide whether the
program intervention is worth supporting and whether the program should be continued,
expanded, or disbanded.
                       5.1.1. Qualitative versus quantitative impact Assessments
Governments, donors, and other practitioners in the development community are keen to
determine the effectiveness of programs with far-reaching goals such as lowering poverty or
increasing employment. These policy quests are often possible only through impact evaluations
based on hard evidence from survey data or through related quantitative approaches.
This chapter focuses on quantitative impact methods rather than on qualitative impact
assessments. Qualitative information such as understanding the local socio-cultural and
institutional context, as well as program and participant details is, however, essential to a sound
quantitative assessment. But a qualitative assessment on its own cannot assess outcomes against
relevant alternatives or counterfactual outcomes. That is, it cannot really indicate what might
happen in the absence of the program. Quantitative analysis is also important in addressing
potential statistical bias in program impacts. A mixture of qualitative and quantitative methods (a
            1   February, 2019
Development planning and project analysis II
mixed-methods approach) might therefore be useful in gaining a comprehensive view of the
program’s effectiveness.
         5.1.2. Quantitative Impact Assessment: Ex post versus ex ante Impact Evaluation
There are two types of quantitative impact evaluations: ex post and ex ante. An ex ante impact
evaluation attempts to measure the intended impacts of future programs and policies, given a
potentially targeted area’s current situation, and may involve simulations based on assumptions
about how the economy works. Many times, ex ante evaluations are based on structural models
of the economic environment facing potential participants. These models predict program
impacts.
Ex post evaluations, in contrast, measure actual impacts accrued by the beneficiaries that are
attributable to program intervention. Ex post evaluations have immediate benefits and reflect
reality. These evaluations, however, sometimes miss the mechanisms underlying the program’s
impact on the population, which structural models aim to capture and which can be very
important in understanding program effectiveness (particularly in future settings). Ex post
evaluations can also be much more costly than ex ante evaluations because they require
collecting data on actual outcomes for participant and nonparticipant groups, as well as on other
accompanying social and economic factors that may have determined the course of the
intervention. An added cost in the ex post setting is the failure of the intervention, which might
have been predicted through ex ante analysis.
                                 5.2. Methodologies in impact evaluation
                                 5.2.1. The Problem of the Counterfactual
The main challenge of an impact evaluation is to determine what would have happened to the
beneficiaries if the program had not existed. That is, one has to determine the per capita
household income of beneficiaries in the absence of the intervention. A beneficiary’s outcome in
the absence of the intervention would be its counterfactual.
A program or policy intervention seeks to alter changes in the well-being of intended
beneficiaries. Ex post, one observes outcomes of this intervention on intended beneficiaries, such
as employment or expenditure. Does this change relate directly to the intervention? Has this
intervention caused expenditure or employment to grow? Not necessarily. In fact, with only a
point observation after treatment, it is impossible to reach a conclusion about the impact. At best
one can say whether the objective of the intervention was met. But the result after the
intervention cannot be attributed to the program itself.
The problem of evaluation is that while the program’s impact (independent of other factors) can
truly be assessed only by comparing actual and counterfactual outcomes, the counterfactual is
not observed. So the challenge of an impact assessment is to create a convincing and reasonable
comparison group for beneficiaries in light of this missing data. Ideally, one would like to
compare how the same household or individual would have fared with and without an
intervention or “treatment.” But one cannot do so because at a given point in time a household or
an individual cannot have two simultaneous existences—a household or an individual cannot be
            2   February, 2019
Development planning and project analysis II
in the treated and the control groups at the same time. Finding an appropriate counterfactual
constitutes the main challenge of an impact evaluation. How about a comparison between treated
and non-treated groups when both are eligible to be treated? How about a comparison of
outcomes of treated groups before and after they are treated? These potential comparison groups
can be “counterfeit” counterfactuals.
Looking for a Counterfactual: With-and-Without Comparisons
Suppose that Amhara credit and saving institute (ACSI) provided credit to poor women hoping
to improve their per capita consumption. If after intervention data shows that the per capita
income of participants is lower than non-participants, does it mean ACSI has failed? Not
necessarily. Comparison of the treated and non treated outcomes doesn’t provide the full picture.
Assume now that after credit the income of the treated is Y4 and non-treated is Y3, as can be seen
in figure 2.2. The with and without comparison measures program effect as Y4-Y3. Such a
comparison could be deceptive. Incomes could be different across control and treated groups
before intervention. If one knew that (Y0, Y2) is the counterfactual the real estimate of the
program effect will be Y4-Y2.
Looking for a Counterfactual: Before-and-After Comparisons
Another counterfeit counterfactual could be a comparison between the pre- and post-program
outcomes of participants. As shown in figure 2.3, one then has two points of observations for the
beneficiaries of an intervention: pre-intervention income (Y0) and post-intervention income (Y2).
Accordingly, the program’s effect might be estimated as (Y2 − Y0). The literature refers to this
approach as the reflexive method of impact, where resulting participants’ outcomes before the
intervention function as comparison or control outcomes.
Does this method offer a realistic estimate of the program’s effect? Probably not. Indeed, such a
simple difference method would not be an accurate assessment because many other factors
(outside of the program) may have changed over the period. Not controlling for those other
factors means that one would falsely attribute the participant’s outcome in absence of the
program as Y0, when it might have been Y1.
            3   February, 2019
Development planning and project analysis II
Reflexive comparisons may be useful in evaluations of full-coverage interventions such as
nationwide policies and programs in which the entire population participates and there is no
scope for a control group.
       5.2.2. The Problem of Selection Bias
An impact evaluation is essentially a problem of missing data, because one cannot observe the
outcomes of program participants had they not been beneficiaries. Without information on the
counterfactual, the next best alternative is to compare outcomes of treated individuals or
households with those of a comparison group that has not been treated. In doing so, one attempts
to pick a comparison group that is very similar to the treated group, such that those who received
treatment would have had outcomes similar to those in the comparison group in absence of
treatment.
Successful impact evaluations hinge on finding a good comparison group. There are two broad
approaches that researchers resort to in order to mimic the counterfactual of a treated group:
   (a) Create a comparator group through a statistical design, or
   (b) Modify the targeting strategy of the program itself to wipe out differences that would
       have existed between the treated and non-treated groups before comparing outcomes
       across the two groups.
Equation 2.1 presents the basic evaluation problem comparing outcomes Y across treated and
non-treated individuals i:
                                 =   +     +   … … … … … … … … … . (2.1)
Here, T is a dummy equal to 1 for those who participate and 0 for those who do not participate.
X is set of other observed characteristics of the individual and perhaps of his or her household
and local environment. Finally, ε is an error term reflecting unobserved characteristics that also
affect Y. Equation 2.1 reflects an approach commonly used in impact evaluations, which is to
measure the direct effect of the program T on outcomes Y.
            4   February, 2019
Development planning and project analysis II
The problem with estimating equation 2.1 is that treatment assignment is not often random
because of the following factors:
   (a) purposive program placement and
   (b) Self-selection into the program.
That is, programs are placed according to the need of the communities and individuals, who in
turn self-select given program design and placement. Self-selection could be based on observed
characteristics, unobserved factors, or both. In the case of unobserved factors, the error term in
the estimating equation will contain variables that are also correlated with the treatment dummy
T. One cannot measure—and therefore account for—these unobserved characteristics in equation
2.1, which leads to unobserved selection bias. That is, cov (T, ε) ≠0 implies the violation of one
of the key assumptions of ordinary least squares in obtaining unbiased estimates: independence
of regressors from the disturbance term ε. The correlation between T and ε naturally biases the
other estimates in the equation, including the estimate of the program effect β.
This problem can also be represented in a more conceptual framework. Suppose one is
evaluating an antipoverty program, such as a credit intervention, aimed at raising household
incomes. Let Yi represent the income per capita for household i. For participants, Ti= 1, and the
value of Yi under treatment is represented as Yi (1). For nonparticipants, Ti= 0, and Yi can be
represented as Yi (0). If Yi (0) is used across nonparticipating households as a comparison
outcome for participant outcomes Yi (1), the average effect of the program might be represented
as follows:
The problem is that the treated and non-treated groups may not be the same prior to the
intervention, so the expected difference between those groups may not be due entirely to
program intervention. If, in equation 2.2, one then adds and subtracts the expected outcome for
nonparticipants had they participated in the program—E (Yi (0) / Ti= 1), or another way to
specify the counterfactual—one gets;
In these equations, ATE is the average treatment effect. It is the average gain in outcomes of
participants relative to nonparticipants, as if nonparticipating households were also treated. The
ATE corresponds to a situation in which a randomly chosen household from the population is
assigned to participate in the program, so participating and nonparticipating households have an
equal probability of receiving the treatment T.
The term B, is the extent of selection bias that crops up in using D as an estimate of the ATE.
            5   February, 2019
Development planning and project analysis II
                                 5.3. Methodologies in impact evaluation
There are different Evaluation Approaches to ex post impact evaluation. Each of these methods
carries its own assumptions about the nature of potential selection bias in program targeting and
participation, and the assumptions are crucial to developing the appropriate model to determine
program impacts. These methods, each of which will be discussed in detail throughout the
following chapters, include
     1. Randomized evaluations
     2. Matching methods, specifically propensity score matching (PSM)
     3. Double-difference (DD) methods
     4. Instrumental variable (IV) methods
     5. Regression discontinuity (RD) design and pipeline methods
     6. Distributional impacts
     7. Structural and other modeling approaches
These methods vary by their underlying assumptions regarding how to resolve selection bias in
estimating the program treatment effect. Randomized evaluations involve a randomly allocated
initiative across a sample of subjects (communities or individuals, for example); the progress of
treatment and control subjects exhibiting similar preprogram characteristics is then tracked
overtime. Randomized experiments have the advantage of avoiding selection bias at the level of
randomization. In the absence of an experiment, PSM methods compare treatment effects across
participant and matched nonparticipant units, with the matching conducted on a range of
observed characteristics. PSM methods therefore assume that selection bias is based only on
observed characteristics; they cannot account for unobserved factors affecting participation.
    1) Randomized evaluations
Setting the counterfactual
Finding a proper counterfactual to treatment is the main challenge of impact evaluation. The
counterfactual indicates what would have happened to participants of a program had they not
participated. However, the same person cannot be observed in two distinct situations—being
treated and untreated at the same time.
The main conundrum, therefore, is how researchers formulate counterfactual states of the world
in practice. In some disciplines, such as medical science, evidence about counterfactuals is
generated through randomized trials, which ensure that outcomes in the control group really do
capture the counterfactual for a treatment group.
Figure 3.1 illustrates the case of randomization graphically. Consider a random distribution of
two “similar” groups of households or individuals—one group is treated and the other group is
not treated. They are similar or “equivalent” in that both groups prior to a project intervention are
observed to have the same level of income (in this case, Y0). After the treatment is carried out,
the observed income of the treated group is found to be Y2 while the income level of the control
group is Y1. Therefore, the effect of program intervention can be described as (Y2− Y1), as
indicated in figure 3.1. Extreme care must be taken in selecting the control group to ensure
comparability.
            6   February, 2019
Development planning and project analysis II
In practice, however, it can be very difficult to ensure that a control group is very similar to
project areas, that the treatment effects observed in the sample are generalizable, and that the
effects themselves are a function of only the program itself.
Statisticians have proposed a two-stage randomization approach outlining these priorities. In the
first stage, a sample of potential participants is selected randomly from the relevant population.
This sample should be representative of the population, within a certain sampling error. This
stage ensures external validity of the experiment. In the second stage, individuals in this sample
are randomly assigned to treatment and comparison groups, ensuring internal validity in that
subsequent changes in the outcomes measured are due to the program instead of other factors.
Calculating Treatment Effects
Randomization can correct for the selection bias by randomly assigning individuals or groups to
treatment and control groups. Let the treatment, Ti, be equal to 1 if subject i is treated and 0 if
not. Let Yi (1) be the outcome under treatment and Yi (0) if there is no treatment. Strictly
speaking, the treatment effect for unit i is Yi (1) – Yi (0), and the ATE is ATE = E [Yi (1) – Yi
(0)], or the difference in outcomes from being in a project relative to control area for a person or
unit i randomly drawn from the population. This formulation assumes, for example, that
everyone in the population has an equally likely chance of being targeted.
Generally, however, only E [Yi (1) |Ti= 1], the average outcomes of the treated, conditional on
being in a treated area, and E [Yi (0) |Ti= 0], the average outcomes of the untreated, conditional
on not being in a treated area, are observed. With nonrandom targeting and observations on only
a subsample of the population, E[Yi(1)] is not necessarily equal to E[Yi(1)|Ti= 1], and E[Yi(0)] is
not necessarily equal to E[Yi(0)|Ti= 0].
Treatment Effect with Pure Randomization
Randomization can be set up in two ways: pure randomization and partial randomization. If
treatment were conducted purely randomly following the two-stage procedure outlined
previously, then treated and untreated households would have the same expected outcome in the
absence of the program. Then, E [Yi (0) |Ti= 1] is equal to E [Yi (0)|Ti= 0]. Because treatment
would be random, and not a function of unobserved characteristics (such as personality or other
            7   February, 2019
Development planning and project analysis II
tastes) across individuals, outcomes would not be expected to have varied for the two groups had
the intervention not existed. Thus, selection bias becomes zero under the case of randomization.
Consider the case of pure randomization, where a sample of individuals or households is
randomly drawn from the population of interest. The experimental sample is then divided
randomly into two groups: (a) the treatment group that is exposed to the program intervention
and (b) the control group that does not receive the program. In terms of a regression, this
exercise can be expressed as Randomization
                                 Yi= α+ βTi+ εi, ……………………(3.5)
where Ti is the treatment dummy equal to 1 if unit i is randomly treated and 0 otherwise. As
above, Yi is defined as
                 Yi≡ [Yi (1)*Ti] + [Yi (0)*(1 – Ti)]. ………………………….(3.6)
If treatment is random (then T and ε are independent), equation 3.5 can be estimated by using
ordinary least squares (OLS), and the treatment effect OLS estimates the difference in the
outcomes of the treated and the control group. If a randomized evaluation is correctly designed
and implemented, an unbiased estimate of the impact of a program can be found.
Concerns with randomization
Several concerns warrant consideration with a randomization design, including ethical issues,
external validity, partial or lack of compliance, selective attrition, and spillovers.
Withholding a particular treatment from a random group of people and providing access to
another random group of people may be simply unethical. Carrying out randomized design is
often politically unfeasible because justifying such a design to people who might benefit from it
is hard. Consequently, convincing potential partners to carry out randomized designs is difficult.
External validity is another concern. A project of small-scale job training may not affect overall
wage rates, whereas a large-scale project might. That is, impact measured by the pilot project
may not be an accurate guide of the project’s impact on a national scale. The problem is how to
generalize and replicate the results obtained through randomized evaluations.
Compliance may also be a problem with randomization, which arises when a fraction of the
individuals who are offered the treatment do not take it. Conversely, some members of the
comparison group may receive the treatment. This situation is referred to as partial (or imperfect)
compliance. To be valid and to prevent selection bias, an analysis needs to focus on groups
created by the initial randomization. The analysis cannot exclude subjects or cut the sample
according to behavior that may have been affected by the random assignment. More generally,
interest often lies in the effect of a given treatment, but the randomization affects only the
probability that the individual is exposed to the treatment, rather than the treatment itself.
Also, potential spillover effects arise when treatment helps the control group as well as the
sample participants, thereby confounding the estimates of program impact. For example, people
outside the sample may move into a village where health clinics have been randomly established,
thus contaminating program effects.
            8   February, 2019
Development planning and project analysis II
   2) Matching Methods- Propensity score matching (PSM)
Given concerns with the implementation of randomized evaluations, the approach is still a
perfect impact evaluation method in theory. Thus, when a treatment cannot be randomized, the
next best thing to do is to try to mimic randomization—that is, try to have an observational
analogue of a randomized experiment. With matching methods, one tries to develop a
counterfactual or control group that is as similar to the treatment group as possible in terms of
observed characteristics. The idea is to find, from a large group of nonparticipants, individuals
who are observationally similar to participants in terms of characteristics not affected by the
program (these can include preprogram characteristics, for example, because those clearly are
not affected by subsequent program participation). Each participant is matched with an
observationally similar nonparticipant, and then the average difference in outcomes across the
two groups is compared to get the program treatment effect. If one assumes that differences in
participation are based solely on differences in observed characteristics, and if enough
nonparticipants are available to match with participants, the corresponding treatment effect can
be measured even if treatment is not random.
The problem is to credibly identify groups that look alike. Identification is a problem because
even if households are matched along a vector, X, of different characteristics, one would rarely
find two households that are exactly similar to each other in terms of many characteristics.
Because many possible characteristics exist, a common way of matching households is
propensity score matching. In PSM, each participant is matched to a nonparticipant on the basis
of a single propensity score, reflecting the probability of participating conditional on their
different observed characteristics X. PSM therefore avoids the “curse of dimensionality”
associated with trying to match participants and nonparticipants on every possible characteristic
when X is very large.
PSM constructs a statistical comparison group by modeling the probability of participating in the
program on the basis of observed characteristics unaffected by the program. The average
treatment effect of the program is then calculated as the mean difference in outcomes across
these two groups. On its own, PSM is useful when only observed characteristics are believed to
affect program participation. This assumption hinges on the rules governing the targeting of the
program, as well as any factors driving self-selection of individuals or households into the
program. Ideally, if available, pre-program baseline data on participants and nonparticipants can
be used to calculate the propensity score and to match the two groups on the basis of the
propensity score.
The PSM approach tries to capture the effects of different observed covariates X on participation
in a single propensity score or index. Then, outcomes of participating and nonparticipating
households with similar propensity scores are compared to obtain the program effect.
Households for which no match is found are dropped because no basis exists for comparison.
To calculate the program treatment effect, one must first calculate the propensity score P(X) on
the basis all observed covariates X that jointly affect participation and the outcome of interest.
The aim of matching is to find the closest comparison group from a sample of nonparticipants to
the sample of program participants. “Closest” is measured in terms of observable characteristics
not affected by program participation.
            9   February, 2019
Development planning and project analysis II
Critiquing the PSM Method
The main advantage (and drawback) of PSM relies on the degree to which observed
characteristics drive program participation. If selection bias from unobserved characteristics is
likely to be negligible, then PSM may provide a good comparison with randomized estimates. To
the degree participation variables are incomplete; the PSM results can be suspect. This condition
is, as mentioned earlier, not a directly testable criteria; it requires careful examination of the
factors driving program participation (through surveys, for example).
Another advantage of PSM is that it does not necessarily require a baseline or panel survey.
Impact Evaluation with STATA??
           10   February, 2019