Attrition, also known as dropout, occurs when participants fail to complete a post-test or leave a study after they have been assigned to an experimental group. It can lead to a biased estimate of the effect size because those that drop-out are likely to be different from those that stay in. For example, less motivated pupils or schools might be more likely to drop out of a treatment. A technique used to counter this potential for bias is “intention to treat” analysis, where even those that drop out of the treatment are included in the final analysis.


Measurements of impact – even those found in the EEF Toolkit – are based on averages, and caution must always be exercised in the interpretation of data presented with averages. The evidence of the positive impact of feedback interventions (+8 months) is what’s called an ‘indicative effect’: this means that it is an indication of the average impact found in the many studies used to reach the finding about feedback’s effect. In some examples of good-quality research, findings are positive, in others there was no impact. In yet others, a negative impact is seen.

When using assessment data in school to measure impact, understanding the spread of data is key. If, for instance, most children in a school achieve scores that are expected for their age and stage, but a small group of them achieve much lower than expected, these low values may draw the average down and give the impression of overall underachievement. So good practice when looking at any average is to try to understand how it is composed (what the spread of scores which contribute to it looks like).


A study is biased if its impact estimate varies from the real impact. This variation can be linked to weakness in the implementation or design of the evaluation.

For example, bias can be introduced if participants themselves decide whether to join the treatment or control groups. This ability to “self-select” could mean that schools with a particularly proactive head teacher or lots of funding make their way into the treatment group, while schools with less motivated head teachers or less money will end up in the control group. When this happens, differences in the outcomes of the two groups may be due to these pre-existing features (e.g. more money or more proactive head teachers) not the intervention, and the estimate of the effect size will suffer from bias.

There are many other potential sources of bias, including measurement bias, which is avoided by ‘blinding’ test delivery and marking, and attrition, which is discussed above.


Blinding is where information about the assignment of participants to their experimental group (e.g. control or treatment) is concealed from the evaluator, the participants, or other people involved in the study until it is complete.

Blinding can be introduced at various points in an evaluation. In EEF-funded evaluations the following are blinded:

Failure to blind can introduce bias. For example, an exam marker may behave differently if they know that their examinees are receiving an intervention. If they do not like the intervention, they may subconsciously mark the intervention group lower than the control group. Even if a marker does their best to remain fair and objective, their own preconceptions of an intervention can still affect their marking and introduce bias, without them realising.

Confidence intervals

All of the effect sizes produced by impact evaluations are estimates. In EEF-funded evaluation reports, confidence intervals provide the range of values that has a 95% probability of including the real effect size. The width of the confidence interval indicates the confidence we can place in a finding: the wider the interval, the less confidence we can have.

For example, a trial may estimate an effect size of 0.4 with confidence intervals between 0.35 and 0.46. This means there is a 95% probability that the real effect size lies between 0.35 and 0.46. If the confidence intervals were wider, 0.05 and 0.6 for example, we would place less confidence in this estimate.

Control group

Sometimes called a “comparison group”, this group does not receive the intervention being evaluated and allows the evaluator to estimate what would have happened if the treatment group had not received the intervention. The control group should be as similar to the treatment group as possible before the intervention is applied. This can be achieved through random assignment or, if randomisation is not possible, matching. There are several types of control group:

Where possible, EEF’s evaluators ensure that a permanent control group is in place so that the long-term impact of an intervention can be estimated.


The outcome for the treatment group if it had not received the intervention is called the counterfactual. If a control group is constructed correctly, it can be used to estimate the counterfactual.

Diagnostic Assessment

A helpful distinction can be made between using assessment to monitor a pupil’s progress, and using it to diagnose a pupil’s specific capabilities and difficulties. Monitoring can be used to identify pupils who are struggling, or whose progress can be accelerated, and diagnostic assessments can suggest the type of support they need from the teacher to continue to progress. When an assessment suggests that a child is struggling, effective diagnosis of the exact nature of their difficulty should be the first step, and should inform early and targeted intervention.

Effectiveness trial

Effectiveness trials aim to test the intervention under realistic conditions in a large number of schools. A quantitative impact evaluation is used to assess the impact on attainment and a process evaluation is used to identify the challenges for delivery at scale. The cost of the intervention at scale is also calculated.

Opportunity cost

Opportunity cost refers to giving up the opportunity and benefit of one activity in place of another. For example, if a teacher has two hours of teaching time with her Year 11 class each week, the opportunity cost of spending 15 minutes each lesson on assessment is the time not spent on learning new material. 


In the context of assessment, reliability is understood as the consistency with which an assessment performs its function. For example, a highly-reliable assessment of maths would be the expectation that an individual scored consistently well over multiple instances of the assessment; if the same child took the same assessment twice in the same day, the outcome would be broadly similar.

It is good practice in using the term reliability to follow it with the phrase ‘…for the purpose of X’. For instance, teachers and school leaders might talk about ‘the reliability of the Year 6 maths exam for the purpose of assessing pupils’ progress in maths during Year 6’.


In the context of assessment, validity is the degree to which an assessment measures that which it intends to measure. For example, a valid assessment of maths would measure how well an individual has understood say fractions or decimals, as opposed to how well they can read the words in a long maths word problem. 

It is considered good practice when using the term validity to follow them with the phrase ‘…for the purpose of X’. For instance, teachers and school leaders might talk about ‘the validity of the Year 6 maths exam for the purpose of assessing pupils’ progress in maths during Year 6’.