Prof. Dr. Elke Sumfleth - Research Interests


Competence Measurement



^ top

Evaluation of the German National Educational Standards in Science

In response to findings of international large-scale assessments such as TIMSS and PISA the Assembly of German Ministers of Education agreed on educational standards as normative guidelines for secondary schools. The national educational standards define four areas of competence: content knowledge, acquirement of knowledge, communication and evaluation & judgement for the three science subjects: biology, chemistry and physics.

Tests for all areas of competence are currently being developed in the German project ESNaS (Evaluation der Standards in den Naturwissenschaften für die Sekundarstufe I - Evaluation of the National Educational Standards for Natural Sciences at the Lower Secondary Level).  Different science educators work on the theoretical framework, give assistance during the item development and during the empirical testing of the items. The involved science educators are:

Jürgen Mayer (University of Kassel, Biology,), Maik Walpuski (University of Osnabrück, Chemistry), Elke Sumfleth (University of Duisburg-Essen, Chemistry), Hans Fischer (University of Duisburg-Essen, Physics) and Alexander Kauertz (College of Education Weingarten, Physics). The project is supervised by the Institute of Educational Progress (IQB) in Berlin. The items are tested in a nationwide pilot study in 2010 and used for a nationwide assessment in 2012.

For large-scale assessments, competence is demanded to be operationalised in competence models. In most cases competence models use several dimensions of which one dimension usually describes the content of the tasks. A certain competence therefore combines a specified content with other dimensions, which, for example, describe the difficulty. As a result, competence describes more than just having knowledge in a domain. In contrast to other large-scale assessments - like PISA - which define levels of achievement post hoc, for the evaluation of the German national standards a model of competence is used in order to define the difficulty of tasks a priori.

Funded by IQB


Further information:

Maik Walpuski, Osnabrück



^ top

Measuring competences in the area of content knowledge

German educational standards define students' learning outcomes in chemistry for 10th grade (KMK, 2005) without a description of compulsory topics. But a content analysis of chemistry curricula shows that there are only few overlaps between the curricula from different states. In order to deal with this deficiency, two types of standard-based test items were developed: The first type provides content information in the item stem (task type A) which students have to use to work on the tasks with a multiple choice/single select format, the second type (task type B) does not include additional content information. So the items and options of both task types are the same and the only difference is the item stem.

The model-fit of the developed test items is investigated with Rasch analysis (Boone & Scantlebury, 2006) assuming item difficulty to increase with increasing content complexity and with increasing level of cognitive processes demanded. Both aspects are taken into account for each item by the construction guideline.

Furthermore, the variables chemical content knowledge, last grade received in chemistry, cognitive skills, reading comprehension and last grade received in German were ascertained with paper-pencil-tests to validate and discriminate the results.

About 1.300 students of 10th grade participated in the assessment. Due to a Multi-Matrix-Design, each of the 160 test items was done by at least 170 students. Most characteristic values of a Rasch analysis (weighted mean square values and t-values) fit the usual criteria of quality (Wilson, 2005). The analysis of item difficulty focuses on the adapted dimension „content complexity" (Kauertz, 2008). First, the two task types were considered separately and the differences between the mean values of the levels of complexity were computed with contrasts. The item difficulty of task type A increases over all five levels of complexity. For task type B it increases too but differences between the levels are smaller. Afterwards, the differences of the mean values were regarded between the task types on every level of complexity. The differences between the task types decrease with increasing complexity.

Further analyses of item data - e. g. differential item functioning and distractors analysis- were performed to guarantee their quality. For the person data the influence of the ascertained companion variables on the achievement in the test was measured. The results show that the application of content knowledge correlates strongly with content knowledge and cognitive skills. The correlations with reading comprehension are on a high level as well but lesser.

Funded by a grant of the Research Training Group nwu-essen (DFG)


Further information:

Mathias Ropohl
Collaborative Project with Maik Walpuski, Osnabrück



^ top

Development of a "NAW-Test" Measuring Scientific Inquiry

This study was the first step to develop tests in the field of scientific inquiry.

Scientific procedures are tested by a newly developed test called NAW-Test (Naturwissenschaftlicher-ArbeitsWeisen-Test).This Multiple-choice-test for the 7th grade concerns three important steps of scientific working: idea, experiment and conclusion.

In practice the students firstly get a short text with some subject-specific information. The subject-specific information is provided to make sure that only the procedural knowledge as basis of scientific working is tested. The information is followed by a description regarding a scientific problem which has to be solved.

In the first task the students have to be make a decision whether a given idea regarding a described problem is appropriate or not, for example. Another task is to decide whether an idea - regardless if the idea is well founded or not - is testable or not. Additionally, there are items in which students have to identify the correct order of steps which are necessary to carry out a certain experiment. Furthermore, the students have to decide if a given generalized statement can be deducted from the described experiment or not. The "NAW-Test" shows a reliability of α = .81 and a regularly distributed item difficulty of .24 < p >.89. Although the NAW tests for three different aspects of scientific working a factor analysis shows that the test measures only one construct. Up to 64 % of the variance is manifested by one factor, while the two others factors which are founded can be veritably. This means that the three steps of scientific working described above can be seen as one homogenous construct.


In the first studies in which the NAW-Test was used an interesting difference between girls and boys occurred. A comparison of the different mean scores of both groups shows that girls achieve higher scores than boys for the beginning of year 7 (first year of chemistry education). This small to middle effect (eta = .152, F(1; 569) = 13.124) is highly significant. After another year of schooling the boys catch up to the girls but also at the end of the 7th grade the girls score higher than the boys.

Funded by a grant of the Research Training Group nwu-essen (DFG)


Further information

Silke Klos



^ top

Measuring competences in the area of inquiry

With the onset of school year 2008 / 2009 a new curriculum in primary science education for North-Rhine-Westphalia came into effect. Here, too, one area of competence is the acquirement of knowledge describing aspects of inquiry competence that should be mastered by the end of primary school. Therefore, a student achievement test (paper-and-pencil test) should be developed that would be adequate to measure differences in performance between students from different types of German secondary schools (Hauptschule (comparable to lower-secondary modern school) and Gymnasium (comparable to grammar school)). So, items are needed belonging to different performance or difficulty levels. According to this requirement, the ESNaS-model of competence (Walpuski, Kampa, Kauertz & Wellnitz, 2008), defining different performance levels was used in an adapted version for the item development. The greatest influence on item difficulty was expected to be related to the factor of complexity. And according to the results of PISA 2003 differences in students' achievement which result by the type of school are expected to be found, too.

All in all 140 multiple-choice items are developed with regard to the different areas defined by the adapted model of competence. In the main study the item pool was tested by an entire sample of 1134 5th grade students from grammar school (Gymnasium) and lower secondary modern school (Hauptschule). Due to the number of tasks a multi-matrix-design was used to spread the items over 18 different test booklets, 18 items each, whereas every item is presented in a minimum of four different test booklets. An overlapping of items resulting from this approach allows IRT method for task analysis (Bond & Fox, 2001), therefore a comparison between item and person parameters is possible. The following variables have been controlled: reading comprehension by speed-test (Lenhard & Schneider, 2006) and cognitive skills by subscales (Heller & Perleth, 2000).

The Rasch model could be confirmed concerning the total sample and the partial sample "Gymnasiasten". All in all the data indicate the test being slightly too difficult with regard to the total sample and slightly too easy concerning "Gymnasiasten". In contrary, the results for "Hauptschüler" show clearly that the test is too difficult for this sample. Merely 53 of overall 140 items are located within the spectrum of the participants' abilities. Most of these items belong to the lowest level of complexity. Nevertheless the test instrument is adequate to measure differences between students performance within the partial sample of German Gymnasiasten and between students from both types of school (Hauptschule and Gymnasium). In addition, all factors (complexity, cognitive processes, procedural knowledge) influence the difficulty of an item with task complexity showing the greatest influence. To answer the second research question (Analogous to PISA, can significant differences between students from Gymnasien and Hauptschulen be detected?) correlations were calculated between the mean person ability between the two different types of schools. The difference between the mean person ability of students from Gymnasium (0.469) and mean person ability of students from Hauptschule (-0.915) is highly significant.

Funded by a grant of the Research Training Group nwu-essen (DFG)


Further information

Susanne Mannel
Collaborative Project with Maik Walpuski, Osnabrück



^ top

How to Measure Experimental Abilities: Hands-on- or Paper-Pencil-Based?

Amongst the core abilities rendered essential in the sciences, experimentation is probably the most characteristic one. Therefore, teachers are challenged to implement approaches to experimentation in their science courses that stress its processes rather than its mere outputs; especially, as science education research has repeatedly pointed out the inadequacy of imitatory experimentation for advancing science understanding (cf. Schauble, Klopfer, & Raghavan, 1991; Hofstein & Lunetta, 2004). Re-orientating experimentation from 'learning of science' towards 'learning to do science' (cf. Hodson, 1996) implies for science educators to scaffold the processes for their students (Lunetta, Hofstein, & Clough, 2007). They can refer to a multitude of scaffolding structures from science education research (e.g. Mayer, 2007; Kipnis & Hofstein, 2008), most of which can ultimately be brought to match with the SDDS-approach (Scientific Discovery as Dual Search: Klahr, 2000). Here, experimentation is constituted by three fundamental elements: (1) searching a mental space of hypotheses, (2) searching a mental space of experiments, and (3) concluding from experimental evidence. This has been translated into a paradigm suitable for schooling purposes: finding ideas/hypotheses - planning and conducting experiments - concluding (cf. Klos et al., 2008).

When thinking of assessing experimentation one has to take into account that at least two distinct sets of skills and abilities come into play in the domain: (1) manual skills enabling proficient handling of apparatus, (2) cognitive abilities guiding structured experimentation. Presumably with regard to the first set, assessments of experimentation other than hands-on have traditionally been put into question (cf. Garden, 1999). The proposed study suggests that the cognitive abilities guiding aimed experimentation can be assessed employing different modes and, therefore, aims at investigating this idea in more detail.

10- to 12-year-old students from the entrance stage of secondary schools (N ~ 200), who have been taught in integrated science classes for about twelve months, worked on (1) three open hands-on experiments which they (2) documented in structured report sheets; (3) they filled in three paper-pencil-tests assessing the aforementioned experimentation-paradigm. While doing the experiments in pairs, students were videotaped. Videos will be analysed according to adherence to the paradigm (Walpuski & Sumfleth, 2009). Experiments address three different topics to account for possible advantages that single students might have due to special interests. Paper-pencil-tests address the same topics (e.g. separation by evaporation) but in different examples (hands-on: desalination of salt water, paper-pencil: determining the sugar content in sweetened tea). In order to control for sequence effects, administration of the three experiments is rotated partially through a Latin square, as well as the order of administration of paper-pencil- and hands-on-assessments is inverted for half the population. As the three measures build on the same theoretical basis, it is expected that results from the measures will correlate highly with each other.

Results from a pilot study show that the employed paper-pencil-tests are satisfactorily reliable. The implemented experiments are suitable for the age-group as determined through face-validity. The data collection is completed by January 2010. Analysis of videos and report-sheets are going on.

Funded by a grant of the Schering Foundation


Further information

Markus Emden

Instructional Materials for Science Courses (Grades 5 and 6)



^ top

Measuring competences in the area of evaluation and judgement
This study is mainly conducted by Maik Walpuski, Osnabrück University)

Following the national educational standards the evaluation and judgement competence in chemistry is defined as the ability to detect and evaluate chemical topics in different contexts (KMK, 2005a). According to Poschmann, Riebenstahl, & Schmidt-Kallert (1998), an evaluation process consists of a minimum of three elements (object, subject, connection between object and subject). During this process criteria have to be identified, evaluation strategies have to be used, and finally a decision has to be made (Jungermann, Pfister, & Fischer, 2005). Furthermore, the evaluation and judgement competence is defined as the ability to decide justifiedly and systematically in complex problem-situations at different options of behaviour (e.g. Bögeholz, Hößle, Langlet, Sander, & Schlüter, 2004). Based on these theoretical assumptions Eggert & Bögeholz (2006) have developed a competence model with the focus on sustainable development in biology to measure the evaluation and judgement competence (e.g. Eggert & Bögeholz, 2006; Bögeholz et al., 2004). Because of its focus on the sustainable development, it cannot be directly adapted for other subjects or topics.

In this project the ESNaS competence model should be adapted for the evaluation and judgement competence and used for the item development. But this model does not consider external aspects, which may influence the evaluation and judgement competence, e.g.:

  • Subject-related aspects, which include content knowledge and application of content knowledge, related to the topics of the items.
  • Interdisciplinary aspects, which include knowledge and application of evaluation strategies and estimation of data quality.
  • Personal aspects, which include individual attitudes and social desirability.

Therefore the main research questions are:

  • To which extent is students' evaluation and judgement competence influenced by external aspects?
  • Is it possible to identify a chemistry-specific evaluation and judgement competence which can be separated from that in other subjects (e.g. biology)?

The sample includes 700 students from 9th and 10th grade of German upper secondary schools (Gymnasium). The time needed by students to fill out the tests and questionnaires will be 180 minutes and data are collected on two days. The instruments used on the first day (90 minutes) are:

  • A newly developed test of evaluation and judgement in chemical/biological contexts to measure the evaluation and judgement competence in biology and in chemistry
  • A test on content knowledge and on application of content knowledge in biology and chemistry
  • A test on social desirability - German version of the balanced inventory of desirable responding (Paulhus, 1998; Musch, Brockhaus, & Bröder, 2002)

The tests of evaluation and judgement in chemical/biological contexts and the tests on content knowledge and on application of content knowledge in both subjects are constructed in a multi-matrix design. Because of that every test item is done by 140 students.

The following tests are used on the second day (90 minutes):

  • The cognitive ability test (Heller & Perleth, 2000) to control the influence of the intelligence
  • A test on knowledge and on application of evaluation strategies in everyday life contexts
  • A test measuring the students' estimation of data quality
  • A test concerning individual attitudes concerning environmental aspects (Frey, Taskinen, Schütte, Prenzel, Artelt, Baumert, Blum, Hammann, Klieme, & Pekrun, 2009)

Funded by IQB


Further information

Julia Göbel
Collaborative Project with Maik Walpuski, Osnabrück



^ top

Measuring competences in the area of communication
(This study is mainly conducted by Maik Walpuski, Osnabrück University)

The project just started.

Funded by a grant of the Research Training Group nwu-essen (DFG)


Further information

Iwen Kobow
Collaborative Project with Maik Walpuski, Osnabrück