Education Endowment Foundation:Evaluation glossary

Evaluation glossary

Key evaluation terms and their definitions.

Attrition, also known as dropout, occurs when participants fail to complete a post-test or leave a study after they have been assigned to an experimental group. It can lead to a biased estimate of the effect size because those that drop-out are likely to be different from those that stay in. For example, less motivated pupils or schools might be more likely to drop out of a treatment. A technique used to counter this potential for bias is intention to treat” analysis, where even those that drop out of the treatment are included in the final analysis.

A study is biased if its impact estimate varies from the real impact. This variation can be linked to weakness in the implementation or design of the evaluation.

For example, bias can be introduced if participants themselves decide whether to join the treatment or control groups. This ability to self-select” could mean that schools with a particularly proactive head teacher or lots of funding make their way into the treatment group, while schools with less motivated head teachers or less money will end up in the control group. When this happens, differences in the outcomes of the two groups may be due to these pre-existing features (e.g. more money or more proactive head teachers) not the intervention, and the estimate of the effect size will suffer from bias.

There are many other potential sources of bias, including measurement bias, which is avoided by blinding’ test delivery and marking, and attrition, which is discussed above.

Blinding is where information about the assignment of participants to their experimental group (e.g. control or treatment) is concealed from the evaluator, the participants, or other people involved in the study until it is complete.

Blinding can be introduced at various points in an evaluation. In EEF-funded evaluations the following are blinded:

  • Randomisation. The person carrying out the randomisation does not know any information that could be used to identify the participants being randomised.
  • Analysis. The person carrying out the analysis does not know any information that could be used to find out which participants are in which experimental group.
  • Provision of the test. Ideally, the person administering the test does not know whether participants are in the treatment or control group.
  • Marking. The marker of the tests does not know whether the test paper belongs to a pupil from the treatment or control group.

Failure to blind can introduce bias. For example, an exam marker may behave differently if they know that their examinees are receiving an intervention. If they do not like the intervention, they may subconsciously mark the intervention group lower than the control group. Even if a marker does their best to remain fair and objective, their own preconceptions of an intervention can still affect their marking and introduce bias, without them realising.

Sometimes called a comparison group”, this group does not receive the intervention being evaluated and allows the evaluator to estimate what would have happened if the treatment group had not received the intervention. The control group should be as similar to the treatment group as possible before the intervention is applied. This can be achieved through random assignment or, if randomisation is not possible, matching. There are several types of control group:

  • Business-as-usual’ control group, which does not receive any intervention and continues to operate as usual.
  • Waitlist control group, which receives the intervention being evaluated at a later date.
  • Active control group, which receives a different intervention.

Where possible, EEF’s evaluators ensure that a permanent control group is in place so that the long-term impact of an intervention can be estimated.

The outcome for the treatment group if it had not received the intervention is called the counterfactual. If a control group is constructed correctly, it can be used to estimate the counterfactual.

An efficacy trial tests whether an intervention worked under ideal conditions.

In practice, EEF efficacy trials aim to test whether an intervention worked under developer-led conditions (with the intervention developer closely involved in delivery) in a number of schools, usually fifty or more. A quantitative impact evaluation is used to assess the impact of the intervention on student outcomes, including attainment. An implementation and process evaluation is used to understand how different aspects of the intervention and its implementation can contribute to successful outcomes. An indicative cost of the intervention is also calculated.

An effectiveness trial tests whether an intervention worked under real-world conditions.

In practice, EEF effectiveness trials aim to test a scalable model of an intervention under everyday conditions (where the developer cannot be closely involved in delivery because of the scale) in a large number of schools, usually 100 or more and usually across at least three different geographical regions. A quantitative impact evaluation is used to assess the impact of the intervention on student outcomes, including attainment. An implementation and process evaluation is used to understand how different aspects of the intervention and its implementation can contribute to successful outcomes at scale and in varying contexts. The cost of the intervention at this scale is also calculated.

An effect size is an estimate of the size and direction of a change caused by an intervention. It is calculated by dividing the difference between the scores for the intervention group and a control group by the variation in that difference.

A research design where the treatment and control groups are identical before the intervention is applied. This is usually achieved through random assignment and allows the evaluator to assume that any change in outcomes is due to the intervention, not any pre-existing characteristics.

Describes the extent to which the results of an evaluation apply to another context. For example, a study which finds that an intervention is effective in primary schools may have poor external validity in secondary schools, because the children will be at a different level of educational development.

Refers to whether an intervention is being implemented as intended by the developer. If there is low fidelity (teachers, students or schools do not follow the programme closely) it is difficult to know whether an intervention is effective or not.

Sometimes called observer effects”, the Hawthorne effect is the phenomenon where participants change their behaviour due to the knowledge that they are being studied. For example, children’s behaviour may improve or a teacher might work harder when an evaluator is observing the lesson. The presence of Hawthorne effects can lead to biased estimation of the effect size. One way of avoiding the Hawthorne effect is to have an active control group.

A project’s impact is the difference between the outcomes achieved by the children who received the intervention and the outcomes of those that did not receive the intervention. Impact evaluation is concerned with identifying the magnitude of this difference (AKA the effect size) and therefore requires quantitative research.

ITT analysis can prevent non-compliance and attrition from biasing a study. Analysis is carried out on the groups as they were formed immediately after randomisation was completed. For example, if one of the participants in the intervention group does not comply with the programme of the intervention, they are included in the final analysis as if they had received the intervention. ITT avoids bias from creeping into the analysis and gives a credible estimate of how effective the intervention is in a real-world setting.

A study has internal validity if the estimate it produces is unbiased.

Any programme, policy or practice being evaluated.

A review of the academic literature on a particular topic.

A method used to construct a comparison group, matching allows evaluators to control for characteristics such as attainment, age, or family income.

Matching is often used to create a control group when randomisation is impossible. Participants in the treatment group are matched to others who are not receiving the treatment according to characteristics thought to be relevant to the outcome measured by the evaluation. For example, pupils receiving an intervention can be matched with a similar group of pupils who have not received it through the National Pupil Database that holds information on the background characteristics and prior attainment of all pupils in England.

Matching allows the evaluator to assume that any differences in the post test are not due to pre-existing differences in the matched characteristics. For example, if you match pupils on their previous attainment, it is safe to assume that previous attainment will not account for differences between the primary outcomes of the experimental groups. However, matching can only be done on observable characteristics. Some characteristics are unobservable (e.g. genetic predisposition, interaction between family, environment and pupil) and are difficult to consider when matching.

Matching can also be used in RCT to ensure that the groups are balanced. For example, participants can be paired on the basis of prior attainment and then one from each pair randomly assigned to the treatment group and one to the control group.

A meta-analysis is the systematic analysis of several pre-existing studies of one intervention in order to produce a quantitative estimate of effect size. Meta-analyses also use the techniques of systematic review to decide which studies are included in the analysis. By combining several studies, the evaluator can gain a more accurate estimate of an intervention’s impact.

A study where the assignment of participants to the treatment and comparison groups is not controlled by the evaluator.

The students, teachers or schools taking part in the trial.

Pilot studies are conducted to refine an intervention that is at an early or exploratory stage of development. Pilots usually run in a small number (three or more) of schools and are used to establish an intervention’s feasibility. Qualitative research is used to develop and refine the approach and test its feasibility in schools, and initial indicative data is collected to assess its potential to raise attainment.

The test or exam provided after the intervention, which provides the data used to establish an effect size. Tests should ideally be administered under exam-like conditions so that pupils cannot be helped and be marked by someone who is blind’ to the group allocation.

The power of a study refers to how likely it is to detect a statistically significant effect size. Before starting a study, evaluators estimate the effect size they expect to find. They use this figure to undertake power calculations and estimate the sample size required for an adequately powered study.

A test that is carried out before the intervention is introduced.

The primary outcome is the outcome that determines whether or not an intervention is considered effective. It should be decided before the trial starts and needs to be stated in the trial registration document. The primary outcome in EEF-funded evaluations is usually academic attainment.

Process evaluation seeks to understand how an intervention was implemented and to understand the views of key stakeholders (e.g. teachers, pupils, intervention staff). It often involves both quantitative and qualitative research. The EEF’s process evaluation framework can be found here.

Qualitative research is concerned with description. It attempts to explore, describe or explain the social world using language.

Quantitative research attempts to establish quantities and magnitude. It attempts to explore, describe or explain the social world using a numerical scale.

An impact evaluation design used when an experimental design is not feasible because the evaluators are not able to control assignment to experimental groups. Quasi-experimental designs use statistical techniques to create treatment and control groups that are as close as possible to identical in all respects before the application of the intervention to the treatment group.

Examples of quasi-experimental designs include matched designs and regression discontinuity designs.

Random assignment is an important feature of randomised controlled trials. It means that the allocation of a participant to the treatment or control groups is solely due to chance, and not a function of any of their characteristics (either observed or unobserved). If a large enough sample of participants is randomised, the two groups will be balanced on every characteristic.

The RCT is a type of experimental design where participants are randomly allocated to the treatment and control groups. Random assignment allows the evaluator to assume that there are no prior differences between the two groups that could affect the primary outcome, and any effect size is therefore due to the intervention received by the treatment group.

Random assignment is used to deal with the problem of selection bias, which occurs when the way in which participants are assigned to experimental groups biases the findings of the study. For example, if an evaluator allows schools to volunteer for the treatment group and fills the control group from the pool of schools that did not volunteer, any difference in the primary outcome could be due to pre-existing characteristics and motivation of the schools that volunteered. Schools that volunteered for the treatment group may have more effective teachers or engaged parents, and these features could mean the treatment schools improve at a faster rate than control schools with less effective teachers or less engaged parents.

The RDD is a type of a quasi-experimental research design. Participants are assigned to the treatment and control groups on the basis of whether they meet a certain threshold. Some may fall just below the threshold and others just over the threshold. It is assumed that the question of which side of the threshold they fall does not have a causal relationship with the primary outcome.

The RDD is best explained with an example. Consider an RDD that is used to evaluate the impact of a summer school that only accepts pupils who score 60% or higher on an exam. The treatment group is constructed from the pupils who score just above 60% and gain a place on the summer school. The control group is the pupils who score just under 60% and just missed out on a place. The assumption is that pupils who were just either side of the threshold are similar and the question of whether pupils crossed the threshold to gain entry to the summer school does not have a causal relationship with their later attainment. You therefore have a treatment and a control group that are very similar in all respects and can be used to estimate the impact of the summer school.

EEF reports are based on CONSORT and will include attrition rates and other sources of bias. Results will be reported on an intent to treat’ basis, where all outcomes, including those from participants who dropped out are included. The effect size and confidence interval around that effect will be reported alongside commentary about how this fits with the existing evidence.

In addition the EEF will be providing a judgement of the security of the findings based on five factors that could influence how reliable the findings are.

The number of participants in the study.

A synthesis of the research evidence on a particular topic, which uses strict criteria to exclude studies that do not fit certain methodological requirements. Systematic reviews that provide a quantitative estimate of an effect size are called meta-analyses.

The group of pupils, classes or schools that receive the intervention.