<img src=" https://secure.leadforensics.com/31510.png " style="display:none;" alt="Lead Forensics Pixel">

RM Results is now known as RM > find out more

Assessment Blog

7th May 2024

Positive results from RM Assessment’s live-stream experiment: using AI to translate test items during the Cambridge Assessment Network conference

Lisa Holloway

Gwyneth Toolan’s live workshop on April 17th, 2024, examined the differences between AI translation and HI (human intelligence) translation.

Gwyneth set out to prove that:

AI is not yet ready to automatically translate test items without robust user testing and human checks

In a post pandemic world and AI obsessed landscape, she set out to explore how we might make test items multilingual, in order to expand access to a global candidature while retaining test validity. However, Gwyneth believed that different tools would offer different outcomes, which was a threat to validity​, and also that human involvement would be needed to ensure cultural sensitivity to avoid offence and certify accuracy.

To find out more about how language translation is used within RM Assessment’s digital products, please visit our previous blog.

Cambridge University Press and Assessment Network Conference_MATTHEW POWER PHOTOGRAPHY262 1

Cambridge University Press and Assessment Network Conference-MATTHEW POWER PHOTOGRAPHY

The session was full, so some delegates had to be turned away. At the beginning of the hands-on workshop, Gwyneth divided the subjects into 4 groups - translating English into Portuguese, German, Greek or Spanish. Each had a human translator, and the rest of the group had access to 3 free AI tools, as shown in the slide below:

Session Divisions

Copyright RM plc

Firstly, the groups uncovered flaws in some of the free AI tools. There were also some interesting observations about the validity of even attempting to turn English based assessments into assessments to be taken in additional languages, which wasn’t a surprise.

From this they were able to discuss the different profile global languages have on the world’s stage, rendering them fundamentally unequal. For example, using AI to translate French is a very different job to using AI to translate Zulu or Igbo or Icelandic. There were issues with consistency and context using AI. Everyone in the experiment could see how a human would need to be involved in any form of translation to ensure the process is robustly tested and validated. Again, this wasn’t a surprise.

As learner experience is at the heart of the assessment, the linguists agreed that using AI tools would mean this would be very poor.

A successful failure achieved

At the end of the experiment, it was acknowledged that Gwyneth and the 4 groups had proven the hypothesis:

AI is not yet ready to automatically translate test items without robust user testing and human checks

“To successfully fail is a technologist’s dream! What I hoped we would do in my workshop, was successfully fail, but all within a 30-minute session! This is seen as an agile approach where you run rudimentary tests as your starting point in order to validate assumptions really early on.” 

Gwyneth Toolan, Innovation Product Manager, RM Assessment


What recommendations did the workshop participants make?

Participants were keen to look at LLMs (large language models) instead of translation tools and were keen to view such tools as assistive technology with scope to produce unique adaptive items. The human marking of any such unique items is a question which would need a lot more interrogation, of course.

Cambridge University Press and Assessment Network Conference_MATTHEW POWER PHOTOGRAPHY

Cambridge University Press and Assessment Network Conference-MATTHEW POWER PHOTOGRAPHY

Delegates also suggested that an assistive AI tool could be used to create large banks of items which support the creation of assessments based on a given syllabus in a given construct, for example GCSE Biology. The machine could be fed any syllabus content as a starting point. Humans might then embark on assimilating such items into viable assessments, rather than generating all the items manually, but humans would ‘quality control’ the entire process. For Gwyneth, this was a surprise, but a pleasant one!

It was also suggested that items should be vetted much earlier in the exam creation journey than they are now, and by candidates- thus employing some agile principles of user testing. Many people implied that assistive tools could be taught a rubric of terms to create a standard assessment profile and remove the threats to validity inherent in disparate terminology use. All brilliant futureproofing ideas to help humans. Gwyneth certainly didn’t come away fearing that AI was going to take over our assessment landscape and render us all unemployed! It was quite the opposite.

Gwyneth was hugely grateful to the wider assessment community for participating in this experiment, to the human translators: Pia, Simao, Filio and Carmen and all the staff supporting the set up and live streaming.

About Gwyneth Toolan, Innovation Product Manager, RM Assessment.


Gwyneth started her career in English teaching, working internationally, then basing herself in a variety of UK state schools doing diverse roles, including Head of Sixth form. After leaving teaching, Gwyneth moved into the assessment world at Cambridge International embarking on the Postgraduate Advanced Certificate in Educational Studies: Educational Assessment at Cambridge and managing syllabuses including Sociology, Psychology and Development Studies. This breadth of educational experience led her to work in innovation and training at RM, where she now leads on product strategy for technical content and customer training and onboarding. Last year Gwyneth was leading RM’s Assessment Malpractice service in the innovation department of RM, which went on to win an e-AA award for ‘most innovative use of technology in assessment’. Gwyneth is interested in the practical implications of AI, its limitations, and the need for all technological advancement to be underpinned by human intelligence, user testing, ethics and empathy. 

Linkedin Logo Twitter Logo Facebook Logo

© 2020 RM Education Ltd. All rights reserved.