Step 4: Manage development of the evaluation design

The design of a program evaluation sets out the research methods that will be used, which provide evidence for the key evaluation questions. The design defines the data that is needed for the evaluation, and when and how it will be collected.

Listen

On this page

The evaluation design needs to ensure that the evaluation will be as rigorous and systematic as possible, while meeting needs for utility, feasibility and ethics.

This section outlines some evaluation design issues for process, outcome and economic evaluations.

Who should develop the evaluation design?

Good evaluation design is critical to an evaluation's overall:

credibility
defensibility
utility.

A program team doing smaller-scale evaluations could have the expertise to develop designs that address descriptive questions. Larger designs may need evaluation expertise from:

evaluation units in the agency
elsewhere in government
external evaluation providers.

The manager doing the evaluation can get advice on the evaluation design, including who should develop it, from:

government evaluation units
the steering committee
the advisory group.

These groups can also review the quality of the proposed evaluation design.

Specialist expertise might be needed to gather data about hard-to-measure outcomes or from hard-to-reach populations. Specialists can also develop an evaluation design that adequately addresses causal attribution in outcome and/or economic evaluations. Experience is needed to:

understand the feasibility of applying particular designs within the context of a program, and
be sensitive to likely ethical and cultural issues.

If you use external providers to develop the design, they might provide an initial evaluation design in their response to the Request for Tender (RFT). This will be reviewed and revised in developing the workplan for the evaluation.

Alternatively, the development of the design may be commissioned as a separate project, so that the design becomes part of the information included in the RFT.

What type of evaluation should be used?

The key evaluation questions will influence the type of evaluation and the methods for data collection and analysis.

Different evaluation types

The evaluation project may use one or more types of evaluation, including process, outcome or economic. The data and findings from individual types of evaluation should inform the other types.

For example, a process evaluation may consider interim outcome data to assess program implementation. An outcome evaluation may rely on evidence from implementation gathered during a process evaluation, to better understand how program delivery contributed to program outcomes. Findings from process and outcome evaluation may inform the design of economic evaluations.

Quantitative, qualitative or mixed methods

The methods for data collection and analysis should be appropriate to the purpose and scope of the evaluation. Most program evaluations will collect both quantitative data (numbers) and qualitative data (text, images) in a mixed methods design to produce a more complete understanding of a program. A combination of qualitative and quantitative data can improve a program evaluation by ensuring that the limitations of one type of data are balanced by the strengths of another. It is important to plan in advance how these will be combined.

Quantitative methods are used to measure the extent and pattern of outcomes across a program using:

surveys
outcome measures
administrative data.

Qualitative methods use:

observation
in-depth interviews
focus groups.

Qualitative methods explore, in detail, the behaviour of people and organisations and enrich quantitative findings. Qualitative methods help to understand the 'how and why,' including if the program is likely to be the cause of any measured change. In programs where outcomes are not achieved, qualitative data can help to understand if this is because of program failure or implementation failure.

Balancing rigour, utility, feasibility and ethics

Part of evaluation design is investigating questions of rigour, utility, feasibility and ethical safeguards. These questions lead to a final design that is as rigorous as possible while delivering a useful, practical evaluation that protects participants from harm.

The evaluation design needs to balance these four elements and so design is often an iterative process. For example, there may be a trade-off between rigour and utility. A very accurate and comprehensive evaluation might not be completed in time to inform key decisions. In this case it might be better to include both short-term and long-term outcomes, so that the initial assessment of whether a program is working can be followed up by a more comprehensive assessment. However, this would require greater resources, especially for tracking client outcomes over time, if these are not already being collected, which might make feasibility difficult.

There may also be a trade-off between rigour and feasibility. Decision-makers may be interested in the effectiveness of a new program and seek an outcome evaluation. This may not be feasible in the first two years of the program while program processes are still being developed and rolled out. A feasible scenario may be a process evaluation followed by a longer time frame for an outcome evaluation.

Designs for process evaluation

Process evaluations explore evaluation questions about program implementation. They may describe:

implementation processes and the pattern of uptake of or engagement with services
check whether a program is being implemented as expected
differentiate bad design (theory failure) from poor implementation (implementation failure).

Process evaluations can be used periodically to undertake cycles of program improvement by informing adjustments to delivery or testing alternative program delivery processes. For pilots, new programs and innovations within a program, process evaluations document how the program is being implemented.

Key evaluation question	Evidence required	Possible methods or data sources
How well has the program been established?	Description of program development compared with client needs, timeframes. Quality of governance, relationships. Influence of different factors and contexts. Initial evidence of uptake.	Program reports, key informant interviews, consultations with managers or service providers, program forums.
How is the program being implemented?	Description of implementation processes by different providers and in different circumstances. The extent that implementation processes met milestones and targets for outputs, timeliness, cost, participation and immediate outcomes. The quality of outputs and immediate outcomes measured against standards and targets. The pattern of outputs, uptake and immediate outcomes, by different sub-groups or in different contexts. Client or customer satisfaction.	Program monitoring data and other program records. Observation including photography and video. Interviews, surveys or focus groups with managers, staff, program clients, referring agencies. Consultations with managers or service providers.
Is the program being implemented well?	As above plus information about good practice in implementation processes.	Expert review of program documents, or observations during site visits.

Key evaluation question

Evidence required

Possible methods or data sources

How well has the program been established?

Description of program development compared with client needs, timeframes.

Quality of governance, relationships.
Influence of different factors and contexts. Initial evidence of uptake.

Program reports, key informant interviews, consultations with managers or service providers, program forums.

How is the program being implemented?

Description of implementation processes by different providers and in different circumstances. The extent that implementation processes met milestones and targets for outputs, timeliness, cost, participation and immediate outcomes.

The quality of outputs and immediate outcomes measured against standards and targets.

The pattern of outputs, uptake and immediate outcomes, by different sub-groups or in different contexts.

Client or customer satisfaction.

Program monitoring data and other program records. Observation including photography and video.

Interviews, surveys or focus groups with managers, staff, program clients, referring agencies.

Consultations with managers or service providers.

Is the program being implemented well?

As above plus information about good practice in implementation processes.

Expert review of program documents, or observations during site visits.

Process evaluations are often designed using the program logic to collect evidence that describes the outputs and immediate outcomes. This may cover:

program reach and uptake across intended target groups
actual implementation processes
participant satisfaction
standards of implementation such as quality, efficiency and cost
the influence of different contexts and other factors on implementation.

A rigorous and systematic process evaluation should bring together evidence from different data sources to answer the evaluation questions. The design for a process evaluation will depend upon:

the size of the program
the scale of the evaluation
the extent to which data on program implementation and uptake is collected through the program's monitoring system.

Process evaluation uses quantitative and qualitative data collection and analysis methods. Quantitative methods typically involve analysing program reach or staff/consumer experiences using surveys and administrative data. Qualitative methods include:

observation studies
interviews
group processes
audits
expert reviews
case studies.

Designs for outcomes evaluation

Outcome evaluation aims to determine whether the program caused demonstrable effects on the defined target outcomes. Most significant programs should seek a rigorous outcome evaluation to demonstrate that:

the investment in the program is worthwhile
there are no major unintended consequences.

Outcome evaluation may also be called impact or results evaluation.

An outcome evaluation should identify:

the pattern of outcomes achieved (for whom, in what ways, and in what circumstances)
any unintended impacts (positive and negative).

Outcome evaluation should examine the ways the program contributed to outcomes, and the influence of other factors.

Depending on the scale and maturity of the program, it may be possible to build in strong evaluation designs when the program itself is being designed. This is ideal as it can facilitate a more rigorous outcome evaluation after the program has become operational.

Before embarking on the design for an outcome evaluation, it may be helpful to work through the following key questions:

What are the outcomes the program aims to achieve?
Are there suitable existing data that actually measure the outcomes of interest? Consider an evaluability assessment.
If not, would it be possible to collect data on outcomes?
Can the counterfactual be estimated in some way? Is there scope to use data from a comparison group? If not, what alternative approach to causal inference should be used?

Key aspects of an outcome evaluation are:

measuring or describing the outcomes (and other important variables)
explaining whether the intervention was the cause of observed outcomes.

Measuring or describing the outcomes (and other important variables)

An outcome evaluation relies on valid and systematic evidence for program outcomes. It is useful to identify any data already available from existing sources, such as:

program monitoring data
relevant statistics
previous evaluation and research projects.

Additional data can be gathered to fill in gaps or improve the quality of existing data using methods such as:

interviews (individual and group, structured, semi-structured or unstructured)
questionnaires
direct measurement.

Descriptions of outcomes should not only report the average effect, but also how varied the results were, especially the patterns for key variables of interest, such as different participant characteristics. It is important to show:

in which contexts the program is more effective
which target groups benefit most
what environmental settings influence the outcomes.

An outcome evaluation may rely on evidence from a process evaluation about program implementation and experiences to gain a better understanding of the drivers affecting program outcomes. Information is also needed about the different contexts in which the program was implemented to understand if a program only works in particular situations.

Explaining whether the intervention was the cause of observed outcomes

An outcome evaluation not only gathers evidence of outcomes, but seeks to assess and understand the program's role in producing the outcomes. The program is rarely the sole cause of changes. The program usually works in combination with other programs or activities and other environmental factors. Therefore, 'causal attribution' does not usually refer to total attribution (that is, the program was the only cause), but to partial attribution or to analysing the program's contribution. This is sometimes referred to as 'plausible contributions'.

In agricultural research, for example, outcomes in terms of improved productivity can be due to a combination of:

basic and applied research
product development
communication programs.

An investment in any one of the above might not be solely responsible for the productivity outcomes. Each investment might be essential, but would not have been able to do so without the other programs. In other words, any one program may have been necessary but not sufficient to bring about that outcome.

Three approaches to investigating causal attribution or plausible contribution are:

the counterfactual – comparing the outcomes with an estimate of what would have happened in the absence of the program
the factual – analysis of the patterns of outcomes, and comparing how actual results match what was expected
alternative explanations – investigate and rule out other explanations.

In some cases, all 3 approaches to causal attribution can be included in the same evaluation design. In complex situations, it might not be possible to estimate a counterfactual, and causal analysis will rely on other approaches. Selecting an outcome evaluation design involves systematically deciding between the options.

Designs for economic evaluation

When economic evaluation is used for program evaluation, it addresses questions of efficiency by standardising outcomes in terms of their dollar value. This approach is sometimes referred to as assessing value for money.

Economic evaluation is used to quickly see if the program has been cost-effective or if the benefits exceed the costs, drawing upon the findings of outcome evaluation. Economic evaluation is also used with a formative purpose during the program design stage to compare different potential options, using modelling of the likely outputs and outcomes, referred to as ex ante evaluation.

Economic evaluation stands between program evaluation and economic appraisal, (the concepts and terms are sometimes used differently in the two fields). These differences are set out in a recent paper by the Productivity Commission.

The main forms of economic evaluation used in program evaluation are:

efficiency analysis
cost-effectiveness analysis
cost-benefit analysis.

The forms of economic evaluation each relies on costing or valuation studies to assign monetary costs to the range of program inputs. But the different forms of economic evaluation use measures of outputs, outcomes or monetised benefits.

Efficiency analysis

Focuses on the inputs-outputs relationships can bring useful insights into delivery processes that can point to opportunities for cost-optimisation. For example, a program designed to reduce recidivism through a different number of clinical support models could use cost-efficiency analysis to compare the cost per person assisted for each of the support models.

Efficiency analysis can explore the factors associated with these differences in costs and establish benchmarks to monitor future costs for different delivery situations.

Cost-effectiveness

Extends the analysis to intended outcomes. Cost-effectiveness analysis is used where the outcomes are not readily measurable in monetary terms, for example in areas of:

health
education
social welfare.

It can be used to:

compare the cost-effectiveness of different programs with the same outcomes, or
determine the most cost-effective delivery options within the same program.

For example, a program designed to reduce recidivism through a number of different clinical support models could use cost-effectiveness analysis to compare the cost of service delivery to the reduction in recidivism for each of the support models.

Cost benefit analysis

Cost benefit analysis is the most comprehensive of the economic appraisal techniques. It quantifies in money terms all the major costs and benefits of a program to see if the benefits exceed the costs, and if so, by how much (shown as a ratio of benefits to costs). It compares the net present value (NPV) of the program's costs with the NPV of its benefits, using a discount rate to reduce the value of future costs or benefits to today's costs and benefits.

Cost benefit analysis is more readily applied to programs:

producing outputs that generate revenue (for example water supply and electricity), or
where the major benefits can be quantified fairly readily (for example roads).

One form of cost benefit analysis that is being used more commonly is a measure of the social return on investment.

Synthesizing evidence into an evaluative judgment

In any type of evaluation it is important to bring together all the relevant data and analysis to answer each evaluation question. It is rare to base the overall evaluative judgment on a single performance measure. It usually requires synthesising evidence about performance across different dimensions.

Methods of synthesis include:

weighted scale
global assessment scale or rubric
evaluative argument.

A weighted scale is where a percentage of the overall performance rating is based on each evaluative criterion. However, a numeric weighted scale often has problems, including arbitrary weights and lack of attention to essential elements.

A global assessment scale or rubric can be developed with intended users and then used to synthesise evidence transparently. A rubric sets out clear criteria and standards for assessing different levels of performance. The scale must include a label for each point (for example, "unsuccessful," "somewhat successful," "very successful") and a description of what each of these looks like.

A more general method of synthesis is evaluative argument. Evaluative argument is a reasoned approach to reaching conclusions about specific evaluation questions by weighing up the strength of evidence in line with the program theory to show the causal links, and describing the degree of certainty of these conclusions.

A related method is contribution analysis which is a systematic approach to developing a contribution story where a counterfactual has not been used. Evaluative argument is suited for the synthesis of findings of more complicated programs, where there are either:

a number of external factors to consider, or
a mix of evidence and different degrees of certainty.

Research design issues

Research design refers to when and how data will be collected to address key evaluation questions. Research design is critical to the rigour of the findings, and the feasibility of the methods for data collection. Two key issues are:

sampling
timing.

Sampling

In some cases it may not be possible or desirable and appropriate to collect data from all sites, all people and all time periods. In these cases a systematic approach to sampling may be needed, so the sample data can be appropriately generalised. For outcome evaluations, the sample needs to be large enough for the results to be statistically valid. Power calculations provide an indication of the minimum sample size needed to assess the impact of a program.

Timing

An important aspect of research design is when data are collected. Approaches include:

Snapshot – collecting data at one point in time. It doesn't allow for analysis of changes over time, except by asking people to report these retrospectively.
Before and after – comparing baseline data to a later stage, such as health indicators before and after treatment, or program performance measures before and after a policy change.

While this can provide evidence that a change has occurred, by itself it doesn't answer questions about the effect of a program. Without a comparison/counterfactual we have no way of knowing whether changes would have occurred anyway.
Time series – collecting data at multiple points over time.

Ethical and cultural issues

Ethics in program evaluation refers to the potential risk of harm to people participating in the evaluation, either informants or evaluators. The types of harm can range, including:

loss of privacy or benefits to program participants
damage to vulnerable groups
physical or mental harm to informants or researchers.

Ethics in program evaluation comes under the broader topic of ethics in human research.

The Australasian Evaluation Society has produced Guidelines for the Ethical Conduct of Evaluation.

The potential risk of harm varies with different evaluation designs, and is an important consideration for the quality and ethics of the evaluation project.

During the evaluation design step, it is critical to identify:

Whether external ethics review is required? Does the agency have a policy or guidelines relating to external ethics review?
Are there vulnerable or culturally distinct groups involved?
Is there linked data involved, with different consent and privacy issues?

An application for an external ethics review can involve substantial work and time.

Research involving animal and human participants requires approval from a recognised ethics committee.

Data collection may be considered "continuous improvement" rather than research that doesn't require external ethics approval.

It is always important to consider the potential benefits, as well as how any ethics approval process will impact the cost and timeframe.

A related issue is the cultural appropriateness of an evaluation, particularly for services and programs for minority or vulnerable groups such as Aboriginal people or refugees. For example, there are guides and standards for involving Aboriginal communities in research projects that cover:

community participation
culturally appropriate methods
providing suitable feedback to the community.

You should look at guides and standards within your agency and from peak organisations in the relevant policy field. You should also consider engaging an Aboriginal or CALD consultant to be involved in the evaluation at an appropriate level, including:

evaluation design
planning
data collection
facilitating or co-facilitating consultations.

Taking account of cultural issues can influence the rigour and feasibility of an evaluation project. There may be additional logistics, such as using interpreters, or time spent working face-to-face with remote communities.

Sources of advice for evaluations to meet ethical and cultural standards for working with Aboriginal communities include: