Cluster C – Exam Evaluation Innovation – AI Assisted Grading

Cluster C: Exam Evaluation Innovation

AI-based grading compared with human assessment (AI Assisted Grading)

The integration of artificial intelligence into grading processes in commercial vocational examinations (AI Assisted Grading) offers considerable potential for increasing scalability, objectivity, and the speed of feedback. A systematic empirical comparison between experienced examiners and advanced AI systems, such as Claude Opus 4.6, reveals a remarkably high level of agreement in the assessment of open-response formats. For examination administration and quality assurance, the statistical analyses point to three fundamental implications:

  • Error-free formal pass/fail classification: Despite isolated deviations, the AI achieved an exact grade match of 77.8%. This is particularly relevant for legal certainty: in no case did the AI produce an incorrect pass or fail decision. All 18 candidates were classified in complete agreement with the human assessment (14 passed, 4 failed).
  • Excellent statistical validity: The indicators demonstrate highly linear and consistent grading behaviour. A correlation coefficient of r = 0.973 and an overall ICC of 0.875 indicate strong agreement in absolute point scores. The mean absolute error (MAE = 0.40 points) is even below half a grading step.
  • Limits of full automation: AI-generated model solutions reached an accuracy rate of 73.5%, which is insufficient for autonomous full automation. While AI shows strong performance in structured calculation tasks, legal classifications, and clearly defined subject concepts, it remains susceptible to context-related blindness and overgeneralisation in highly context-dependent tasks and in tasks requiring deep reflection or judgement. The most appropriate use case is therefore as an assistive assessment instance, for example in preliminary grading or quality assurance.
Data basis

18 IHK examinations from 5 chamber districts

Industrial clerks · Business processes · Summer 2025

Human–AI comparison:

Experienced examiners ↔ Claude Opus 4.6 (AI tool)

Comparison design
  • 34 sub-tasks per examination
  • 6 task blocks · max. 100 points
  • Assessment basis: AkA model solution provided by the expert committees
Results
Exact grade match:
77,8 % (14/18)
Grade deviation (max. one grade level):
22,2 % (4/18)
Pass/fail decision correctly assigned:
18/18 (100 %)

(14 passed + 4 failed)

In none of the tested cases would the AI have produced a different pass or fail decision.
Central validation model: AI as an assistive assessment instance

Point of departure

  • The potential of AI is particularly evident in open-response formats and performance tasks
  • It promises scalability, objectivity, consistency, and rapid feedback
  • The quality required for high-stakes examinations remains the central question

Guiding questions

  1. Does AI generate comparable point scores?
  2. Do grades and pass/fail decisions remain stable?
  3. For which task types is agreement high or low?

Validation logic

  1. Parallel grading by human examiners and AI
  2. Multi-level comparison (points, grades, pass/fail decisions)
  3. Analysis of deviation patterns
  4. Derivation of use scenarios (pre-grading, second grading, quality assurance)
AI-generated model solutions

Strengths:

They demonstrate a sound understanding of business-related concepts and are well suited to calculation tasks, legal classifications, and clearly defined subject concepts.

Limitations:

They remain vulnerable in context-rich tasks, complex examination settings, and tasks requiring reflection or judgement, particularly due to the risk of context-related blindness or overgeneralisation.
Statistical indicators

Pearson r = 0,973

Measures the strength of the linear association between the total scores assigned by human examiners and AI.

→ Very strong linear association

Gesamt-ICC = 0,875

Intraclass correlation; examines the extent to which exactly the same point scores were assigned.

→ Good absolute agreement

MAE = 0,40 Punkte

Mean absolute deviation per sub-task in the direct comparison.

→ Less than one grading step (0.5 points)

RMSE = 0,70 Punkte

Root mean square error; gives greater weight to larger deviations.

→ No extreme individual deviations observed
AI-generated model solution: 73.5% correct (25/34)
→ Insufficient for full automation.