Education Endowment Foundation:Synthetic data for the EEF Archive

Synthetic data for the EEF Archive

Read more about the synthetic data project utilising the EEF data archives

What is synthetic data?

Synthetic data is an artificial copy of a real data set. It is made to follow the structure and some of the patterns of the original dataset and preserves plausible values. For example, someone’s height in a synthetic dataset will never be 100 meters and ages in a secondary school dataset will range from 11 – 16. Synthetic data should not allow for the identification of individuals present in the original data. In summary, synthetic data reveals very little, if anything about individuals in the original dataset, but is designed to represent the dataset as a whole well.

Applications of synthetic data in research include the ability to gain insight into the data structure, prior to accessing the original data in the ONS SRS, the ability to prepare and test cleaning and analysis code, and, in some cases, explore relationships in the data. For the purposes of the EEF archive, synthetic data enables more informed data archive proposals and allows researchers to progress work whilst applications to original data are pending.

Low-fidelity vs high-fidelity synthetic data

Synthetic data can be created with varying degrees of fidelity to the original dataset, which range from low to high. In simple terms, it means how closely the artificial data mimics the real data.

Low-fidelity data maintains data types and plausible values but does not maintain relationships between variables. High-fidelity data also maintains some relationships between variables. Creation of higher fidelity synthetic data and decisions around the nature and level of relationships to preserve presents a trade-off, as higher fidelity presents a greater risk of accidental identification of real individuals.

Synthetic data and the EEF archive

To facilitate researcher access to using the EEF archive and support efficient access applications, EEF commissioned the Behavioural Insights Team (BIT) to create low-fidelity synthetic datasets of data held in the EEF Data Archive. You can find these appended to individual project pages in our data catalogue, under the Resources” tab. An easy-to-use tool for how this synthetic data was created is available on GitHub.

In addition, a team at Nesta conducted a study into the feasibility and utility of creating higher fidelity synthetic data for the EEF archive. Read more about the findings of this study here.

This work was funded by a grant from the Evaluation Task Force (Cabinet Office/​HM Treasury) and its Evaluation Accelerator Fund (EAF).