EEF Blog: How do we make EEF trials as informative as possible?

A new paper has been published by academics at Loughborough University and the University of York, claiming that some randomised controlled trials (RCTs) are ‘uninformative’. Camilla Nevill, our Head of Evaluation, reflects on their findings.

A key aim of the EEF is to produce evidence to inform teachers and senior leaders. Since we were set up in 2011, the EEF has committed over £110 million to evaluations of 190 education programmes, 150 of which are RCTs, involving more than half the schools in England.

In this new paper [1], the authors, Hugues Lortie-Forgues and Matthew Inglis, have reviewed the headline attainment findings from 82 RCTs commissioned by the EEF, alongside 59 from the US-based National Centre for Education Evaluation and Regional Assistance (NCEE). They conclude that 40% of these estimates are ‘uninformative’ [2] – which they define as being small and imprecise (with a large confidence interval, or a Bayes factor of 3 or less), meaning we cannot say conclusively whether the programme does or does not work.

Here I discuss their definition of ‘uninformative’, followed by their three explanations given for the finding: 

  1. the unreliability of prior evidence, 
  2. difficulties in translating research into practice; and 
  3. the design of the trials themselves.

How informative is ‘uninformative’?

Lortie-Forgues and Inglis only record the headline attainment finding in their analysis. The level of statistical noise around a headline finding is, of course, important, and not to be ignored – I will discuss this shortly. But there are many other types of information generated by these RCTs that are also informative for schools.

Some of the EEF’s early trials were guilty of focusing solely on the headline impact estimate. This is why, in 2016, we published best practice guidance that highlights the importance of also collecting high-quality data on implementation, compliance and causal mechanisms in order to understand why and how programmes do, or do not, work.

In 2014, the EEF expanded its remit to include non-attainment outcomes – such as self-control, social skills, motivation and resilience – that are thought to underpin success in school and beyond. EEF RCTs also frequently capture data on mechanisms of change, such as the home learning environment or teaching practices. Nearly nine out of ten EEF RCTs include at least one secondary outcome. Many of these are likely to achieve much larger (or ‘informative’) effects. But the EEF, and others, believe it is essential that the headline is always as comparable and relevant to teachers and policy-makers as possible, however high the bar.

Lortie-Forgues and Inglis’ definition also hinges on their argument that ‘RCTs which convincingly demonstrate that a given intervention does not work are equally valuable’. The EEF has always said that we want to be able to advise schools what not to spend their money on, too. However, EEF trials usually compare programmes to ‘business as usual’, and occasionally to similar programmes – such as an online alternative – because these are the most relevant comparisons for teachers. But in this scenario, finding no effect could mean that both worked equally well. This is why EEF RCTS also provide information on programme costs and comparison group activity to inform schools’ decision-making.

The fact that under these conditions programmes which are popular or appear to have potential are not shown to be substantially better tells us something important – both about previous claims made, and that schools should be cautious about expecting large effects from them.

Finally, there is also a strong case against using confidence intervals in a binary way to make decisions and null hypothesis significance testing. Statistical uncertainty is only one factor that can effect whether an estimate accurately reflects the true value in the population.

Interpreting RCTs can be complex. A measure of what is ‘informative’ defined by the precision of the estimate misses many other important elements of security that arguably tell us more about the reliability of the evaluation and its conclusions. This is why the EEF created its padlock rating, designed to summarise in a single scale many sources of bias that could threaten the security of a result, such as attrition and imbalance – important information to help time-poor practitioners engage meaningfully with our findings.

So under a more helpful definition, all EEF trials are informative. Nonetheless, the fact that educational effects are often smaller than previously thought, and harder to detect, is an important insight and challenge emerging from the EEF and NCEE’s work. Why is this happening and what is the EEF doing about it?

Explanation 1: Many of the interventions studied are ineffective because the literature upon which they are based is unreliable

The EEF shares this concern with Lortie-Forgues and Inglis. Many previous evaluations are less reliable than we would like, although they might often be considered ‘informative’ under the authors’ definition. Currently, the impact estimates in our Teaching and Learning Toolkit are based on evidence from many meta-analyses, or ‘reviews of reviews’. These often combine studies of varying quality, on different ages, subjects and even countries. This is why we have commissioned the EEF Education Database, a major initiative involving scores of coders coding the estimated 10,000 individual studies within the Toolkit.

This will enable us to remove low-quality studies, likely reducing impact estimates across the board. It will also enable us to generate more accurate estimates. If the EEF is considering funding an RCT of collaborative learning, we will be able to analyse how impact varies by age and subject in prior studies, to better inform sample sizes. The EEF is constantly responding to the growing body of reliable evidence it generates, which is why EEF RCTs have grown progressively in size since we started.

Lortie-Forgues and Inglis recommend more trial pre-registration and data sharing to address this issue and note the high standards and transparency that both the EEF and NCEE insist upon. All EEF RCTs are conducted by one of the EEF’s panel of independent evaluators, registered, and we require a pre-specified protocol and analysis plan to be published on our website. To avoid reporting bias, all of our findings are published, whatever the result, and, on completion, data is transferred to the EEF’s data archive.

The EEF can be proud that nearly 100 RCTs linked to longitudinal outcomes data have been made available for research purposes. Durham University, the EEF’s overarching evaluator, re-analyses all the headline findings. This work has informed the EEF’s analysis guidance, which aims to improve comparability and reduce variation in estimates, and their precision, based on different evaluators’ analytical choices.

Explanation 2: Many of the interventions studied are ineffective because they have been poorly designed or implemented

The EEF also shares Lortie-Forgues and Inglis’ concern that translating research insights into effective implementation at scale is challenging. It is difficult to balance the need to slowly design, improve and embed implementation with demand from practitioners and policy-makers for quick results.

The education market is flooded with untested products and programmes. The EEF’s role is also to evaluate them, using experimental designs. However, this can mean fewer success stories.

The EEF recognises this issue which is why we have published best practice guidance on implementation and have established criteria for programme selection. We have always worked closely with delivery organisations to understand their delivery model and theory of change; but it is certainly the case that some early EEF trials, such as Tutoring with Alphie, could have benefited from more development. For this reason, we are increasingly funding more collaborative and dynamic piloting and design work at all stages of programme development and testing.

The EEF is also prepared to stop trials of poorly implemented evaluations early, in order to avoid money being wasted; to date, two of our 150 trials have been cancelled. In both cases, here and here, we published reports (and an accompanying blog) setting out the valuable lessons we had learned and which might also be helpful to other funders and researchers.

There is a balance to be struck between implementation design and testing what is in the system. It is thanks to the constructive and collaborative relationship the EEF has with its delivery and evaluation partners that we are able to tackle this.

Explanation 3: Many of the interventions studied are effective, but these trials were not designed so that their effects could be reliably detected

Finally, Lortie-Forgues and Inglis say, ‘appropriately powering a trial can be challenging’. We wholeheartedly agree with this. There are many competing factors to consider when determining sample size, including prior research, educational importance, cost and the capacity of delivery organisations.

For example, there is often a direct tension between the need for high-quality implementation and sufficient power, with the capacity of the delivery organisation sometimes a limiting factor. Lortie-Forgues and Inglis suggest trials could always be of a sufficient size to detect an effect lower than 0.05. But small effects may not be educationally meaningful, nor cost-effective for expensive programmes, and such precision would mean often sacrificing implementation quality.

To power an education RCT to detect an effect of 0.05, under reasonable assumptions (e.g. 80% power, an ICC of 0.1 and 50% variance explained), would require upwards of 800 schools and be hugely costly.

Funding fewer, very large trials, would limit the number of programmes we could test for schools and provide less useful results. It is unrealistic to expect a single trial to give a definitive and dichotomous answer to a question. Outcomes will vary under different contexts, conditions, and populations. This is why the EEF is more cautious, testing multiple versions of an approach (such as effective feedback, or phonics) at a smaller scale, under 'best possible' conditions and on targeted populations, before taking them to scale. A body of evidence will always be more potent than a single study and this is why the EEF’s guidance reports are always based on meta-analysis and systematic reviews.

It is also why the EEF data archive, as it grows, will become a hugely powerful resource for exploring and understanding the drivers of children’s outcomes, across multiple studies, populations and contexts.

Conclusion

Despite its alarmist title, Lortie-Forgues and Inglis’ paper is scientifically useful and constructive. The EEF has long recognised and is responding to the challenges it raises. The evidence base we are helping to generate is becoming more reliable every year thanks to the efforts of grantees bringing forward their projects, schools being willing to trial them, and evaluators independently assessing their impact.

The EEF, and evaluation colleagues, also have an important role to play in communicating, in a balanced and accessible way, the benefits of high-quality research to policy-makers and practitioners. Otherwise there is a real risk of undermining its value. Nevertheless, we welcome the debate generated by papers such as these, and share the authors’ passionate pursuit of RCTs that are as informative as possible.

**


[1] Rigorous Large-Scale Educational RCTs are Often Uninformative: Should We Be Concerned? / Lortie-Forgues, Hugues; Inglis, Matthew. In: Educational Researcher, 28.01.2019.
[2] This figure for ‘uninformative’ RCTs was 57% when first published. Since then, the authors have issued a correction, reducing the figure to 40%.