Do EEF trials meet the new ‘gold standard’?

Two respected American academics have rocked education research in the US with the recent publication of a provocative paper which questions the usefulness of established evaluation methods

High-quality randomised controlled trials (RCTs) have long been considered the ‘gold standard’ for outcome evaluation across a number of fields, from health to education. In Ginsburg and Smith’s March 2016 paper “Do randomised controlled trials meet the ‘Gold Standard’”, the authors have argued convincingly that the gold standard as defined by the What Works Clearinghouse (WWC) may be flawed. If correct, their findings call into question the value of a huge body of supposedly first-rate evaluation across the field of education.

Since the EEF was set up in 2011 we have commissioned 121 evaluations, 100 of which have been designed as RCTs. We have always maintained that RCTs are the best way of providing useful, comparable results that answer questions time-poor practitioners care about. RCTs are the bedrock of our approach and a paper like this therefore demands our attention, and a considered response

Ginsburg and Smith’s paper reviews 27 maths trials that meet the WWC quality standards and identifies seven threats to the usefulness of these RCTs. I will look at each of the seven threats in turn. Before I do, I should point out that we have always been conscious that just because an evaluation is an RCT, that does not mean the findings are secure. It is for precisely that reason that the EEF developed its padlock security ratings designed to summarise, in a single scale, a number of possible sources of bias that could threaten the security of a finding. Interpreting RCTs is complex, and trying to communicate this complexity in a single scale is difficult and controversial; but EEF believes that it is vital to address the problem if we are to help time-poor practitioners engage meaningfully with our findings

1) Developer associated – Of the 27 RCTs reviewed by Ginsburg and Smith, 12 had authors who had an association with the curriculum’s developers

Conflict of interest is a big problem across health and education. In medicine, most drugs trials are funded by drugs companies, leading to accusations of reporting bias and negative findings being withheld. A defining feature of EEF’s approach is ‘independent’ evaluation; all of our evaluations are conducted by one of our panel of independent experts. Evaluators are appointed through a competitive tendering process and reviewed for any academic conflict of interest

We also require a pre-specified protocol for every trial to be published on our website and registered on ISRCTN, a primary clinical trial registry. All of our findings are published, whatever the result.

Insisting upon independent evaluation has been challenging. There has been resistance in some parts of the academic community and we have had two evaluations fall through because the developer would not agree to the independent evaluation. But it has also resulted in some highly successful collaborations and EEF can be proud that we, along with our partners, are paving the way in addressing this threat

2) Intervention not well-implemented – In 23 of the 27 RCTs, implementation fidelity was threatened because the RCT occurred in the first year of implementation. It may take up to three years to implement a substantially different pedagogical approach.

It is difficult to balance the need for interventions to fully embed with the demand from practitioners and policy makers for results as soon as possible. As a result we have been somewhat guilty of this threat in early trials. One of our first RCTs, of Maths Mastery in primary schools, found a small positive effect in the first year. This may have been higher if we had looked at children’s outcomes in the second year, once the programme was settled in schools

EEF works closely with grantees to understand their delivery model and we are getting better at designing evaluations that accurately reflect interventions’ theory of change. Two of our most recently funded trials focus on the second year of implementation. We are also aware of the need to allow sufficient time after randomisation for deliverers to lay the ground for successful implementation.

Implementation is hard and this is why we will be shortly publishing our best practice guidance on evaluating implementation. Our role is to test interventions in the system that schools are buying. We test these interventions using pragmatic RCT designs, report on their effectiveness and describe how they were implemented. But we can only do so much; it is then up to the system to respond.

3) Unknown comparison – In 15 of 27 RCTs the comparison activity was not identified or outcomes were reported for a combination of activities.

EEF trials usually compare interventions to ‘business as usual’, because this is the most relevant comparison for our audience of practitioners. But what does that mean? With targeted interventions for struggling children, business as usual is likely to be highly active, and if we see no effect it may be that both control and intervention approaches have worked – the question is then which one is cheaper. There is also the risk of ‘contamination’ and ‘compensation rivalry’ that must to be addressed.

The EEF padlock rating specifically identifies ‘insufficient description of the intervention’ as a threat to the validity of a finding. However, some early trials have been guilty of insufficiently describing the control

We will soon be publishing best practice guidance that highlights the importance of in-depth implementation evaluation to accurately account for control group activity and unpack the RCT ‘black box’

Of course, the Holy Grail is to conduct multi-armed trials, directly comparing different approaches against each other. There are considerable practical challenges – such as identifying directly comparable interventions, recruiting and retaining schools to complex trials – but we should not let these outweigh the considerable potential benefits. EEF is looking for opportunities to do this and has funded about 10 multi-armed RCTs to date

4) Instruction time greater in treatment than control – in eight out of nine relevant RCTs the treatment time differed substantially from that in the control group.

This is closely related to the threat above. Many academics worry that without a matched time control it is impossible to eliminate the Hawthorne Effect or identify interventions’ effectiveness over additional support. There is also the risk that the control group gets more support, apparently muddying the interpretation of a null result

But EEF trials are pragmatic and designed to answer questions teachers might ask, such as “should I buy this tutoring programme or continue what I was doing anyway?” That is why our comparison is always business as usual as opposed to an artificial ‘matched time’ control. However, it is still important to document what business as usual looks like and how much an intervention costs, compared to business as usual, so practitioners can make a fair assessment of which path to take

5) Limited grade coverage – in 19 of 20 relevant RCTs an intervention covering two or more years does not have longitudinal cohort and cannot measure cumulative impact.

All EEF effectiveness trials, and some EEF efficacy trials, have a long-term control group built into the design. At the end of the evaluation all data is transferred to EEF’s archive and pupils are tracked through the National Pupil Database. This means that we can see, for example, whether a numeracy programme in Key Stage 1 has had an impact on a child’s maths GCSE results, and beyond

We have built longitudinal follow-up into our approach from the start. EEF can be proud that more than 100 RCTs linked to longitudinal data in our archive hold the powerful promise of revealing what really makes a difference to children’s outcomes for years to come

6) Assessment favours content of the treatment – In five of the 27 RCTs the assessment was designed by the developer.

This is a big problem tending to result in greatly inflated effect sizes. The EEF has strict guidance on test selection for the primary outcome in our trials, the main criteria being that it must be approved by the independent evaluator, have broad external validity and be highly correlated with performance in national high stakes tests. We can be confident that our trials avoid this threat.

7) Outdated curricula –19 of 27 studies RCTs were carried out on outdated interventions or curricula.

It is almost impossible to avoid this issue. EEF works closely with its partners, including schools and the Department for Education, to ensure that all of its trials are policy relevant and address issues that schools face today. But the time it takes to conduct a trial means that, by the time the results are published, there is always a risk that the debate has moved on. The challenge is to design trials that produce results quickly without compromising implementation and long-term follow up

EEF welcomes Ginsburg and Smith’s paper, just as we welcome any other well-evidenced appraisal of our approach. It is work like this that helps us to continually critically assess our methods and keep our evaluation at the cutting edge of education research. While the academics’ thought-provoking paper has highlighted seven threats, we are satisfied that our approach fares well. Where they have revealed areas for improvement, we will be taking action, including amending our best practice guidance and reviewing our padlock security ratings

RCTs are rarely perfect, but we should not let the best become the enemy of the good. We remain convinced that RCTs provide the best way of producing useful results for schools. Improving the education of disadvantaged children is our common goal. We are in this together and look forward to the continuing debate

Do EEF trials meet the new ​‘gold standard’?

Do EEF trials meet the new ‘gold standard’?