Cluster A – Exam Question Innovation – Smart Tasks

Cluster A: Exam Question Innovation

for commercial vocational examination tasks

The dashboard below presents the empirical findings on the evaluation and innovation of commercial vocational examination tasks in the context of artificial intelligence. Drawing on central explanatory models and a traffic-light classification, it illustrates which task types remain suitable for examination practice.

Data basis & AI analysis process

The empirical basis of the study comprises N=102 examined tasks. Within the AI analysis process, a test prompt was applied across a total of 14 AI systems. A standardised prompt was used to examine the tasks, with the AI evaluating them from a didactic perspective rather than merely solving them. The generated results were subsequently aggregated and analysed.

Examination suitability & traffic-light classification
Green
26
Yellow
52
Red
24
25.5%
51.0%
23.5%
Central effect model

Empirically significant effects on the overall evaluation of an examination task:

1. Judgement competence

The more human judgement a task requires, the more suitable it is for examination purposes.

2. Openness

The more open-ended a task is, the more likely it is to remain suitable for examinations despite AI.

3. AI susceptibility to error

The more strongly AI reaches its limits, the more suitable the task is for examination use.

Non-significant examined effects

Characteristics without empirical significance in the model:

Reflective competence Problem-solving competence Numeracy/Data Literacy Reading competence Integrative competence Contextual embeddedness Need for argumentation Need for interpretation
Assessment logic of the traffic-light scheme

Green: suitable for examination

High openness, argumentative contextualisation, and strong judgement and reflective competence. AI produces incomplete solutions and is prone to error.

Yellow: suitable for examination with limitations

Partially open-ended. AI provides superficial answers. High reading competence is required, and the AI-generated solution must be corrected or expanded.

Red: unsuitable for examination

Closed tasks, knowledge-based or computational tasks. AI provides consistent solutions. Low numeracy and data literacy demands. AI generates complete solutions.

The Augmentation Trap Model

A dynamic analysis of AI productivity and the erosion of expertise

Introduction

The use of artificial intelligence can raise productivity in the short run, but it can also weaken workers' expertise when core cognitive processes are delegated to the tool. Caosun and Aral (2026) develop a dynamic model that captures this tension between immediate productivity gains and skill erosion. Two parameters are central: the skill-neutral productivity effect α, which is independent of the user's expertise, and the knowledge-complementary productivity effect β, which scales with expertise. The delegation intensity u describes how much of the task is handed over to AI.

Production function

y(u, S) = (1 - u)S + (α + βS - γu)u

The production function combines a human contribution that declines as AI delegation increases with a productivity effect of AI use that depends on usage intensity and the worker's skill level. S denotes the worker's current skill, while u denotes the share of the task delegated to AI. The first component (1 - u)S is the remaining human contribution: the more work is delegated, the smaller this contribution becomes. The second component (α + βS - γu)u captures the productivity effect of AI use. Here, α is the skill-independent gain from tasks that the AI can handle largely on its own. β captures the knowledge-complementary gain: cases in which the quality of AI output depends on the worker's judgment. γ imposes diminishing marginal returns to AI use, because the easiest-to-delegate tasks are transferred to the AI first.

The skill dynamic arises because high delegation reduces opportunities for independent practice. The optimal delegation policy u*(S) therefore depends not only on α, β and γ, but also on the skill recovery or forgetting rate κ and the decision-maker's evaluation horizon. When β > 1, complementarity dominates and experts gain more from AI use. When β < 1, AI is more substitutive, giving more experienced workers weaker incentives to delegate.

The two AI productivity channels

Skill-neutral channel α

This component delivers value independently of the user's expertise. Examples include the automated drafting of routine text, standard data preparation or form completion. It can raise output even for novices, but it contributes little to learning or skill formation.

Knowledge-complementary channel β

This component increases with the user's expertise. Experienced users can direct, evaluate and refine the tool more effectively, producing higher-quality results. Examples include complex programming, diagnostic reasoning or research-intensive work.

Skill recovery rate κ

This parameter describes how quickly skills are rebuilt through active practice or lost under sustained delegation. Deliberate practice, mentoring and unassisted work phases can reduce the long-run cost of cognitive offloading.

Key results of the model

Steady-state loss

Even when AI raises short-run output, the long-run steady state can fall below the no-AI benchmark because the worker's expertise erodes over time.

Augmentation trap

When decision-makers have short evaluation horizons or ignore the private value of skill, they may choose excessive delegation, shifting the long-run cost of skill atrophy to workers.

Complementarity vs. substitution

When β > 1, AI and human expertise reinforce each other. When β = 1, the AI effect is skill-neutral. When β < 1, AI substitutes for skill and changes who delegates most.

Skill divergence

Especially under low β, the workforce can split: experienced workers preserve expertise and continue to benefit, while less experienced workers delegate more and may deskill.

Five regimes of AI deployment

The interaction of α, β and the skill recovery rate κ partitions the parameter space into five regimes with different long-run implications for output and expertise:

Region I: Non-adoption

α and β are too low to justify AI use. The productivity gain does not offset coordination and delegation costs, so no adoption is optimal.

Region II: Augmentation (worse off)

Between the adoption boundary C0 and the long-run break-even boundary B, AI use is attractive in the short run, but the long-run state is worse than the no-AI benchmark.

Region III: Automation (worse off)

AI is productive enough to justify full automation (u = 1), but its raw output α remains below human potential. Expertise collapses and long-run output falls.

Region IV: Augmentation (better off)

With high β, AI complements human judgment. Skill remains valuable, and long-run productivity exceeds the no-AI benchmark.

Region V: Automation (better off)

The skill-neutral AI contribution α is high enough to dominate human potential, making full automation both rational and beneficial for highly standardized tasks.

Design implications

To preserve the long-run benefits of AI and avoid the augmentation trap, the authors point to design and governance choices that raise complementarity and protect skill formation:

Deliberate practice phases: Integrate unassisted work periods and mentoring to maintain and rebuild skill.
Longer evaluation horizons: Assess performance over longer time spans so that short-run output gains do not crowd out expertise development.
Explainability and reasoning: Encourage workers to document their reasoning paths, for example through structured prompts, so that the tool supports rather than replaces thought.
Graduated autonomy: Adapt AI use to the user's expertise: scaffolding for novices, critical review and judgment-intensive control for experts.
Adaptive difficulty: Increase task complexity dynamically so that AI assistance continues to create learning opportunities.

Source: Caosun, Michael & Aral, Sinan (2026): The Augmentation Trap: AI Productivity and the Cost of Cognitive Offloading, arXiv (Cornell University). DOI: 10.48550/arxiv.2604.03501

Cluster assignment

Best fit: Cluster A - Examination Task Innovation / Smart Tasks. The augmentation trap model provides a theoretical rationale for why examination tasks in an AI-rich environment should not focus solely on the production of a finished commercial work product. Cluster A is concerned with authentic, AI-secure performance tasks in which human solutions remain qualitatively superior because they require openness, contextual understanding, judgment, reflection and critical evaluation of AI outputs. The model connects directly to this logic: tasks with a high skill-neutral component α are easy to automate and therefore didactically fragile; tasks with a high knowledge-complementary component β make expertise visible because solution quality depends on professional interpretation, plausibility checking and justified decision-making.