From education to employment

How generative AI could improve rather than threaten assessment 

NCFE GenAI report article

A major fear since the advent of ChatGPT in November 2022 is that the integrity of remote assessment in higher and further education is under threat. Several studies have shown how generative AI (GAI) is capable of passing exams in a variety of subjects.  

Funded by the NCFE Assessment Innovation Fund, a team from The Open University (OU) has been investigating the robustness of different types of assessment in the light of GAI. The findings confirm problems with GAI detection and provide pointers towards designing assessment in ways that are not only more robust against GAI, but also focus more squarely on the human ‘value-added-ness’ that educators really want to assess.  

The findings confirm problems with Generative AI detection

The full research report contains full details of the method and findings. In this piece, we share our takeaways on these findings. 

One surprising finding was that the wide variety of different assessment types did not prevent GAI’s ability to pass nearly all of them. There was one exception (a role-play question supported by a clear marking rubric) where all the GAI scripts were awarded a fail. However, GAI grades declined significantly as the level of study increased, with GAI being less adept with higher critical-thinking skills. 

Another surprise was that whilst the training on GAI hallmarks, which was highly valued by markers, led to a strong improvement in markers’ ability to detect GAI scripts, the proportion of student scripts being incorrectly attributed to GAI (a false positive) also increased markedly. The research suggested that the identified ‘hallmarks’ of GAI are more useful for guiding question- and marking-guidance design rather than GAI detection. 

GAI hallmarks

While the GAI hallmarks were also present in student scripts, some were more common in the GAI scripts, such as a failure to use the module materials (at the OU, these are behind a paywall, so for now less exposed to GAI). Moreover, some hallmarks manifested in different ways between GAI and students. For example, as one marker reported: ‘GAI tends to provide both sides of an argument and then it’s not coming to a conclusion, while students just fail to put forward a convincing argument throughout.’ 

For providers, the authors recommend that when assessors identify these hallmarks in student work, the focus should be on providing additional support to the learner, regardless of whether the issue stems from AI misuse or poor study skills. This is because the report identified that both generative AI answers, and low-performing learners’ answers, shared these hallmarks. Although this approach might be seen as controversial, it offers a practical solution to the problem of detection and helps improve learners’ study skills. 

Combining the lowest performing GAI grades with the best detection performance both in terms of GAI-detection and low false positives, the question types that stood out as being more robust included: 

  • Role-play questions, where students need to critically select and apply what they have learned in the module to realistic scenarios.  
  • Reflection on work practice, provided it requires evidence of specific examples, in contrast to the superficial answers provided by GAI. 

These question types align with what is often referred to as ‘authentic assessment’. The research suggests that remote assessment could be more robust by designing questions to require application of what has been taught, evidential backing and specific conclusions, supported by strong marking guidance.  

This research builds on work that has previously shown that GAI can pass written assessments. The challenge thrown down to the education sector by these new findings is two-fold. Firstly, can we increase our use of authentic assessment types, which not only help mitigate the misuse of GAI by learners but also benefits learners as they enter the workforce?  

Secondly, is the sector ready to show learners how to use AI in assessment effectively, whilst ensuring that the evidence generated remains a valid and reliable marker of what a learner knows or what they can do? 

By Jonquil Lowe Senior Lecturer in Economics and Personal Finance at The Open University 

Liz Hardie Director of SCiLAB and senior lecturer in Open Justice at The Open University 

Gray Mytton Assessment Innovation Manager at NCFE 

The full report can be accessed at www.ncfe.org.uk/help-shape-the-future-of-learning-and-assessment/aif-pilots/the-open-university/ 


Related Articles

Responses