Cluster C – Exam Evaluation Innovation – AI Assisted Grading

Cluster C: Exam Evaluation Innovation

AI-based grading compared with human assessment (AI Assisted Grading)

The integration of artificial intelligence into grading processes in commercial vocational examinations (AI Assisted Grading) offers considerable potential for increasing scalability, objectivity, and the speed of feedback. A systematic empirical comparison between experienced examiners and advanced AI systems, such as Claude Opus 4.6, reveals a remarkably high level of agreement in the assessment of open-response formats. For examination administration and quality assurance, the statistical analyses point to three fundamental implications:

  • Error-free formal pass/fail classification: Despite isolated deviations, the AI achieved an exact grade match of 77.8%. This is particularly relevant for legal certainty: in no case did the AI produce an incorrect pass or fail decision. All 18 candidates were classified in complete agreement with the human assessment (14 passed, 4 failed).
  • Excellent statistical validity: The indicators demonstrate highly linear and consistent grading behaviour. A correlation coefficient of r = 0.973 and an overall ICC of 0.875 indicate strong agreement in absolute point scores. The mean absolute error (MAE = 0.40 points) is even below half a grading step.
  • Limits of full automation: AI-generated model solutions reached an accuracy rate of 73.5%, which is insufficient for autonomous full automation. While AI shows strong performance in structured calculation tasks, legal classifications, and clearly defined subject concepts, it remains susceptible to context-related blindness and overgeneralisation in highly context-dependent tasks and in tasks requiring deep reflection or judgement. The most appropriate use case is therefore as an assistive assessment instance, for example in preliminary grading or quality assurance.
Data basis

18 IHK examinations from 5 chamber districts

Industrial clerks · Business processes · Summer 2025

Human–AI comparison:

Experienced examiners ↔ Claude Opus 4.6 (AI tool)

Comparison design
  • 34 sub-tasks per examination
  • 6 task blocks · max. 100 points
  • Assessment basis: AkA model solution provided by the expert committees
Results
Exact grade match:
77,8 % (14/18)
Grade deviation (max. one grade level):
22,2 % (4/18)
Pass/fail decision correctly assigned:
18/18 (100 %)

(14 passed + 4 failed)

In none of the tested cases would the AI have produced a different pass or fail decision.
Central validation model: AI as an assistive assessment instance

Point of departure

  • The potential of AI is particularly evident in open-response formats and performance tasks
  • It promises scalability, objectivity, consistency, and rapid feedback
  • The quality required for high-stakes examinations remains the central question

Guiding questions

  1. Does AI generate comparable point scores?
  2. Do grades and pass/fail decisions remain stable?
  3. For which task types is agreement high or low?

Validation logic

  1. Parallel grading by human examiners and AI
  2. Multi-level comparison (points, grades, pass/fail decisions)
  3. Analysis of deviation patterns
  4. Derivation of use scenarios (pre-grading, second grading, quality assurance)
AI-generated model solutions

Strengths:

They demonstrate a sound understanding of business-related concepts and are well suited to calculation tasks, legal classifications, and clearly defined subject concepts.

Limitations:

They remain vulnerable in context-rich tasks, complex examination settings, and tasks requiring reflection or judgement, particularly due to the risk of context-related blindness or overgeneralisation.
Statistical indicators

Pearson r = 0,973

Measures the strength of the linear association between the total scores assigned by human examiners and AI.

→ Very strong linear association

Gesamt-ICC = 0,875

Intraclass correlation; examines the extent to which exactly the same point scores were assigned.

→ Good absolute agreement

MAE = 0,40 Punkte

Mean absolute deviation per sub-task in the direct comparison.

→ Less than one grading step (0.5 points)

RMSE = 0,70 Punkte

Root mean square error; gives greater weight to larger deviations.

→ No extreme individual deviations observed
AI-generated model solution: 73.5% correct (25/34)
→ Insufficient for full automation.

When AI Agents Pass Exams

On the validity of digital online tests in the age of autonomous AI systems

Background

Classical online tests are usually based on asynchronous multiple-choice tasks embedded in learning platforms and serve as indicators of competence. With the rise of autonomous AI agent systems that independently navigate learning paths and complete examinations on behalf of their users, this model is increasingly called into question. The authors emphasise that these systems no longer merely respond to prompts like chatbots, but autonomously move through complex examination scenarios and thereby severely undermine the credibility of traditional assessment formats.

Agentic AI Compared with Chatbots

Whereas conventional chatbots primarily react to user input, agentic AI systems act independently: they receive a goal-oriented task, navigate learning platforms on their own, identify the tools required and complete complex examination tasks step by step. For example, they can search online courses, scan texts and automatically log in to external systems in order to use additional data. Even security mechanisms such as CAPTCHAs can be bypassed with the help of current models.

Evidence from Practice

Faster Course Completion

In field trials conducted by the Zukunftslabor Generative KI, autonomous agents completed standardised online courses on the AI Act considerably faster than human participants. While humans require around four hours, specialised agents complete the courses in roughly 90 minutes — achieving top scores of more than 90 percent. These examples indicate that the time-intensive examination mode may become obsolete.

Transferability

Further tests demonstrated transferability: agents mastered certifications in project management (Scrum), complex theoretical examinations for drone pilot licensing and an English language test at the highest C2 level. Even complex integral problems in Moodle learning environments were solved reliably. This underscores the disruptive potential of agentic AI in education.

Dead-Loop Learning: The Process

The article describes an automated process that the authors call “Dead-Loop Learning”. The procedure can be divided into four phases:

1. Creation AI generates course content and learning paths.
2. Completion The agent independently completes tasks and examinations.
3. Validation A testing system checks the solutions and assigns points.
4. Certification A certificate is issued without any human intervention.

Because the agent both generates learning material and solves and evaluates tasks, a closed examination loop emerges in which human control is barely present. This jeopardises the evidentiary value of online tests as proof of individual competence.

Implications for Teaching and Assessment

New Competence Priorities

With the idea of “new skilling”, reflective capacities, ethical awareness and sovereign interaction with AI move to the foreground. Educators will need to place stronger emphasis on argumentative justification and contextual understanding rather than on the mere retrieval of results.

Necessary Infrastructure

Improved digital infrastructure — for example, learning management systems such as Moodle — and self-hosted or on-premise operation become decisive for maintaining data sovereignty and ensuring a reliable examination environment. External proctoring services can thereby be replaced.

Recommendations

  • Rethink assessment design: Instead of standardised multiple-choice tests, tasks should require reflective argumentation, transfer performance and open-ended solutions that agents cannot independently generate.
  • Expand digital infrastructure: Invest in secure, high-performance learning platforms and local hosting solutions in order to keep data and processes controllable.
  • Promote digital competences: Train educators and learners in the critical use of AI as well as in ethical and legal questions, enabling a conscious interaction between humans and machines.
  • Use hybrid assessment formats: Combine digital tests with in-person examinations in order to ensure personal interaction and authenticity.
Source: Doris Wessels & Miriam Maibaum (2026): “Mit einem Klick: Warum agentische KI-Systeme das Ende klassischer Onlinetests einleiten”, in Künstliche Intelligenz – Forschung & Lehre, May 2026.
Cluster Assignment

Cluster C – Innovation in Examination Evaluation / AI-Assisted Grading.

The contribution fits Cluster C because it fundamentally problematises the robustness of existing digital assessment and examination arrangements. If agentic AI can autonomously complete standardised online examinations, the central question becomes under what conditions assessment results can still be considered valid, fair and meaningful. This issue touches the core of Cluster C: the analysis of AI-related assessment structures, their limits and the requirements for robust and quality-assured examination procedures.