How to use the Attainment Measures Database

The database includes 231 measures that were identified in the first stage of the review, based on recommendations from the advisory panel of experts, EEF studies and communication, hand searching 18 publisher websites and searches of the ERIC database. For all these measures, minimal information is provided in the database.

Of these, 37 measures were subject to full evaluation using selected questions about implementation utility and psychometric properties (reliability, validity and quality of norms from the European Federation of Psychologists’ Associations test review model - Evers, Hagemeister, et al., 2013). Summary information about implementation utility, or how easily a test can be administered, can be used to filter the database and is then summarised in the additional information for each measure. The evaluation of the psychometric properties of each measure is summarised on the basis of the presence or absence of a star for construct validity, criterion validity and reliability. Guidance on how to interpret this information is provided below.

The most important decision that you make when measuring attainment, is determining what construct you want to measure. A test is only valid if it measures what you intend to measure.

Psychometric Properties:

Construct validity, criterion validity and reliability:

In the review, the psychometric properties were rated on a 0-4 point scale:

  • Rating of 0: no sufficient information was available for evaluation
  • Rating of 1: indicates that the available information does not well support construct validity/ criterion validity/reliability
  • Rating of 2: indicates limited support for construct validity/ criterion validity/reliability
  • Rating of 3: indicates adequate evidence for construct validity/ criterion validity/reliability
  • Rating of 4: indicates that that there was good evidence for construct validity/ criterion validity/reliability

Further information about how these ratings were determined is provided in the review.

For the purposes of the database, all ratings are transformed into a star rating:

  • One star is assigned for measures that received the highest ratings of 3 or 4. 
  • Half a star is assigned to measures that received a rating of 1 or 2 
  • no star is assigned to measures that received a rating of 0. Note that this means this property could not be evaluated. This should not be taken to imply that the measure is unreliable or invalid.

Construct validity:

Construct validity examines the extent to which the test actually measures what it sets out to measure, or instead partially or mainly measures something else.

Construct validity was rated on a 0-4 point scale in response to the question “Does it adequately measure one or more key constructs of literacy, mathematics or science”. This was evaluated based on information reported in all available sources combined with a subjective assessment of how well the target construct adequately measures general attainment in each of the subjects.

Note, that construct validity is influenced not only by the test properties but also how you choose to apply a test. This rating of construct validity is only relevant if the construct it was judged against aligns with what you want to measure. The review provides further information about how to determine construct validity.

Criterion validity:

Criterion validity considers the extent to which test scores are related to scores on a real world measure of the construct. As part of the review, criterion validity was assessed using evidence from comparisons against national key stage tests, GCSEs and A-levels.

There are different ways to assess criterion validity. This can be the predictive validity (how well performance in the test correlates with future outcomes in key stage assessment), concurrent validity (how well test performance correlates with current outcomes on a criterion measure) or post-dictive validity (how well test performance correlates with prior performance on a criterion measure). Correlations between these tests provide a statistical measure of validity and are reported in the database where available. The review provides further information about how to interpret these correlations.

Reliability:

Reliability refers to the extent to which the test scores are likely to be reproducible. Any measurement is subject to random errors caused by inconsistencies in the examinee’s and examiner’s behaviour, as well as test content. Measures of reliability explain the degree to which the test is free from such measurement error. There are a range of different statistical tests of reliability. Information on how to interpret these different measures of reliability is provided in the review.

Reliability was rated on a 0-4 point scale in response to the question “Is the test performance reliable”? Reliability was rated based on the overall evidence for reliability, supported by a summary describing the nature and strength of the available evidence.

Implementation Utility

Implementation utility refers to how easily a test can be administered. This information is summarised in the database but is not evaluated because it is subjective for each user and their circumstances. Decisions over which test format is best for you depend on a multitude of factors including the purpose for the assessment, availability of resources (including facilities, money and time), the child(ren) being assessed and the tester. To help you to decide which tests are more suited to your needs, the database includes filters and summary information about administration format, response format, assessor requirements and scoring. What follows is a description of how you might use this information.

  • Administration format: Additional information about what the test measures – this information is to help you identify the suitability of the test for your purposes.
  • Are additional versions available? For example, paper and digital formats, multiple forms to prevent practice effects in repeat testing.
  • Whether subtests can be administered in isolation – in some contexts you may be interested in using only some of the subscales.
  • Administration group size: Can the test be administered individually and/or in small groups or a whole class?
  • Administration duration: How long does it take to administer the test?
  • Description of materials needed to administer the test (for example computer, internet access, license, user manual)?
  • Does the test require special testing conditions?

Response format:

  • Response mode: Is the test administered electronically, orally, or with paper and pencil?
  • What devices are required if administered electronically. For example, computer, tablet, headsets?
  • Question format: multiple choice, open ended or mixed?
  • Progress through questions: This may be flat or adaptive. Flat progress means that every examinee taking the same test answers the same questions. In adaptive tests, examinees receive different items depending on their responses within the test itself to avoid questions that are too easy or too hard.

Assessor requirements:

  • Is any prior knowledge, training, profession accreditation required for administration? Note that the requirements for administration and scoring sometimes differ. Where possible this is also highlighted.
  • Is the administration scripted? This means the assessor reads a script when administering the test.

Scoring:

  • What materials are needed to score the test. For example, user manual, teacher guide, supplementary norms?
  • What are type and range of available scores? This can include raw scores, deciles, z-scores, standard scores, stens, stanines, or T-scores. The review explains the difference between these scores in detail and provides guidance on how to interpret these scores.
  • What score transformations are available for standard scores? This can include age standardised, cohort standardised or grade standardised scores. More information is provided in the review for guidance on how to interpret these scores.
  • What are the age bands used for norming? Age bands used to convert raw scores to standard scores differ between tests. They are indicated in months.
  • What is the scoring procedure? This can range from computer scoring with machine readable paper forms, direct entry by the test taker, simple manual scoring or complex scoring that may require training or scoring by the publisher.
  • Is automatised scoring available? This indicates whether this is available and in what form. For example, machine reading, computerised, online or bureau service (scored by test provider)

Norms:

This part of the evaluation examines whether the standardisation sample represents the general UK school population well. It includes information about potential sources of biases in sampling which could influence the generalisability of normed scores. This information is descriptive and so is included in the additional summary information rather than rated.

Filters:

The database includes several filters which allow you to search for measures that fulfil your individual requirements.

You can filter measures by subject, key stage for which the tests are appropriate, whether they have recent UK norms, the group size for administration, the response mode and question format.

Filters for group size, response mode and question format are only available for measures that were subject to the full evaluation. Note that the filter options are not mutually exclusive. For example, many tests can be administered both individually or in groups, to suit users’ needs.

By ticking individual boxes, you will be able to, for example look for tests that measure attainment in literacy, that are available for pupils in KS1 and KS2, that can be administered in small groups, and that use a paper and pencil response format (see example below).

AMD

To search for a particular measure, you can enter its full name or abbreviation into the search bar at the top of the database.

The review was conducted by an independent team based on inclusion and exclusion criteria determined a priori for each phase and documented in a protocol. The list of measures included in the database are a result of this particular review, but we do not claim that this is an exhaustive list of all tests available to measure attainment in literacy, mathematics and science.