A significant issue – how do we know when results are reliable?

In this blog, the EEF’s head of research, Danielle Mason, explores the challenge of answering the question we’re often asked of our trials, “how reliable is this result?”, and our current thinking on the issue of ‘statistical significance’…

Blogs •3 minutes •22 August, 2018

Our aim at the EEF is to make educational evidence both accessible and easy-to-understand, so that teachers and senior leaders can use it to support their decision making

That’s why we created our Teaching and Learning Toolkit, and its Early Years companion, which summarise educational research from around the world. For over 30 educational approaches, the Toolkit presents the average impact, the estimated cost, and the strength of the supporting evidence, all on a single headline page

The problem with ‘statistical significance’

If you’re interested in how evidence can support decision making in schools, then you may have come across the term ‘statistical significance’, when using education evidence. The idea of statistical significance is used by many researchers to try and quantify the statistical uncertainty around impact estimates, such as EEF estimates of the impact of interventions on literacy or numeracy

Though ‘statistical significance’ is a widely-used term, its use divides the research community (as discussed here). Depending on who you ask, statistical significance is an essential part of impact evaluation, just one aspect of a broader picture, or a meaningless and misleading concept which should be abolished altogether!

Such a lack of consensus among researchers is tricky for those of us in the business of trying to make evidence accessible and easy-to-understand for busy teachers who simply want to know “how reliable is this result?”

We wanted to set out, therefore, our current thinking on the issue of ‘statistical significance’, which we do in this document. [6 Feb 2020 update: please note, the EEF has subsequently issued a new document, ‘Statement on statistical significance and uncertainty of impact estimates for EEF evaluations’, which supersedes and builds on this previous document.]

How we report uncertainty in EEF trials

Understanding the uncertainty around evaluation findings is essential because not all findings from EEF trials are equally secure

However, statistical uncertainty (which is what ‘statistical significance’ is designed to assess) is not the sole factor which we need to take into account when deciding how reliable the evaluation results of a trial are

Other factors, like the size, design and implementation of a project evaluation, will affect the security of the results and therefore the degree of confidence we have that what is being reported is an accurate estimate of the project’s impact

Early in the EEF’s work, therefore, we decided it was important to capture this range of factors in a single measure for teachers and senior leaders wanting to make decisions based on our evidence

This is why we present – for all EEF-funded projects – our ‘padlock’ security rating. This six-point scale is designed to take into account the wider range of factors which affect security. Some of these are factors that influence statistical significance. Others (such as sources of bias) are not related to significance, but can be just as important in assessing whether a result is secure

It is this ‘padlock’ rating which is used consistently when reporting all EEF trials: we do not require evaluators to test for statistical significance

All EEF trials are independently evaluated, though, so, if our independent evaluators provide a test of statistical significance as a way of assessing the level of uncertainty in an EEF trial, this will always be published in the evaluation report

Conclusion: a 6‑point summary

If you’d like to know more about how the EEF’s ‘padlock’ security rating and statistical significance are related, you can read this more detailed discussion. [6 Feb 2020 update: please note, the EEF has subsequently issued a new document, ‘Statement on statistical significance and uncertainty of impact estimates for EEF evaluations’, which supersedes and builds on this previous document.]

Otherwise, I hope this bullet point summary is as accessible and easy-to-understand as the EEF always aims to be!

Randomised Controlled Trials are a rigorous way to test whether an education intervention can improve attainment.
But even the result of a Randomised Controlled Trial is always an estimate: there will always be uncertainty around the precise size of the educational impact.
We ask all of our independent evaluators to take account of this uncertainty in their reports. Some do this by considering ‘statistical significance’, and in some cases they use this to assess whether the intervention had an impact. The EEF publishes ‘statistical significance’ tests if the evaluator provides them.
However, the EEF does not use ‘statistical significance’ tests alone to assess whether an intervention had an impact.
The EEF’s ‘padlock’ security rating is designed to provide a user-friendly way to think about the overall security of the result from each EEF trial, using a scale from zero to five.
Along with many others in the research sector, we are continuing to think about the best way to present uncertainty, and the role that statistical significance should play in this, if any.

A significant issue – how do we know when results are reliable?

The problem with ​‘statistical significance’

How we report uncertainty in EEF trials

Conclusion: a 6‑point summary

The problem with ‘statistical significance’