1. Introduction
In 2018, Reference Arnold, Kilian, Thillosen and ZimmerArnold et al. (2018) described online, or digital exam formats, as widely used summative assessments in university education. Since then, their usage has further increased due to the Corona pandemic (Reference Heidkamp-Kergel and KergelHeidkamp-Kergel & Kergel, 2022). Digital exam formats offer the opportunity of automatic assessment. The automatic assessment is a big advantage for university teachers as it reduces their workload (Reference Arnold, Kilian, Thillosen and ZimmerArnold et al., 2018), which is a motivating factor for teachers in engineering design education to develop digital automatically evaluable exam formats (Breitschuh et al., 2014). In addition to the advantages for teachers, students benefit from faster feedback compared to paper-based exams and the same objective and transparent assessment (Reference Berkemeier, Bilo, Fischer, Fortmann, Frommer, Graf-Schlattmann, Gollan, Hahne, Huth, Kamin, Keller, Kirberg, Pahlke-Kullik, Stegemerten, Temps, van Ackeren, Wilde and WrengerBerkemeier et al., 2018).
Nevertheless, teachers and students have major concerns about digital exams. Teachers see limitations in assessing learning outcomes with the existing technical opportunities (Reference Arnold, Kilian, Thillosen and ZimmerArnold et al., 2018). Students are afraid of getting worse grades due to reduced inputs and determined evaluation steps. To examine the extent to which these concerns are justified, the comparison of exam results from paper-based exams and digital exams is necessary. While using the same tasks in digital exams that were used in paper-based exams is possible in some cases though often not reasonable, there are tasks in engineering design education that, to date, cannot be used identically online.
Engineering design education in higher education includes technical drawing and dimensioning of machine elements (Reference Kossack, Neumann, Bender, Dillenhofer, Kunne, Sersch and GustKossack, Neumann et al., 2024). Reference Baronio, Bodini, Copeta, Dassa, Grassi, Metraglia, Motyl, Paderno, Uberti, Villa, Cavas-Martínez, Eynard, Fernández Cañavate, Fernández-Pacheco, Morer and NigrelliBaronio et al. (2019) discuss assessing the competencies required for technical drawing with existing technical solutions, e.g., closed questions like Multiple-Choice. However, it is only possible to examine specific elements of tasks at lower taxonomy levels with their approach; transformation of paper-based tasks into digital formats is impossible (Reference Violante, Moos and VezzettiViolante et al., 2020). Several approaches for new technical solutions for the automatic evaluation of technical drawings or CAD models are published (e.g. in Reference Fastabend, Müller, Roth and KreimeyerFastabend et al. (2024), Reference DillenhöferDillenhöfer (2023), and Reference Hoppe, Gembarski and LachmayerHoppe et al. (2021)). However, the implementations are under development and not yet suitable for automated evaluation of competencies, because the results of these systems are not reliable enough without manual corrections for exams.
Dimensioning of machine elements includes multi-step calculations and the interpretation of the results in a certain application context. Until today, literature rarely mentions how competencies about dimensioning machine elements can be assessed automatically. In mathematics, findings exist about the implementation of multi-step calculations with a plugin for Moodle called STACK (Altieri et al., 2020). In STACK, typical mistakes can be anticipated and assigned with specific scores, thus students receive points for their results like in paper-based exam assessments. These insights imply that STACK can be suitable for the transformation of exam tasks on dimensioning machine elements from paper-based into automatically evaluable digital exam tasks in terms of multi-step calculations. However, the insights do not enable statements on how assessable the necessary interpretation of the calculation results is.
This paper focusses on the application potential provided by Moodle with STACK for automated engineering design education assessments. Thus, automated assessments for dimensioning of machine elements are presented while the automatic assessment of technical drawings is omitted. To guide our research, we investigate the following research question:
To what extend can be found differences in the results between digital and paper-based examination formats when assessing the same learning outcomes with the same tasks about dimensioning machine elements in engineering design education?
To investigate this research question initial and experience-based, section 2 contains the transformation of existing paper-based exam tasks from an engineering design education course at Ruhr-University Bochum, Germany, in automatically evaluable digital exam tasks. We implemented the digital exam tasks in the learning management system Moodle. For the transformation of the exam tasks, the section describes functions of Moodle in terms of assessment possibilities and general strategies for the transformation of paper-based exams into digital exams. The functions and strategies are used to transform the paper-based exam tasks of the course Engineering Design B (“Konstruktionstechnik B”) into digital exam tasks addressing the same learning outcomes. Section 3 presents the data collection and analysis methods we used as well as differences in the exam results between the paper-based and transformed digital exams. It compares the course examination results of the digital exam format with data from the summer term 2024 with the data from the paper-based format in summer term 2023. Section four discusses the results in terms of the results’ limitations and recommendable actions derived from the results.
2. Digital exam tasks for calculations in engineering design education
To initiate the comparison between paper-based and digital exams, we prepared exam tasks presented in this section. At Ruhr-University Bochum, the learning management system Moodle is available for the implementation of digital exams, which is a software that is widely used in engineering design education in Germany (Reference Kossack and BenderKossack & Bender, 2024). Moodle includes a tool called STACK, which is an implementation of the Computer Algebra System Maxima (STACK Contributors, 2024).
2.1. Activity Quiz including the STACK Plugin in Moodle
Learning management systems like Moodle offer using different types of questions with a reliable automatic evaluation (Moodle Contributors, 2024). The question types are distinguished between closed and open questions (Reference MiddendorfMiddendorf, 2022). In close questions, learners choose from given answers or assign elements of different lists to each other (Reference Mayer and HertnagelMayer & Hertnagel, 2009). A typical example is the question type Multiple-Choice. In open questions, learners enter an answer freely. In automated assessments, the student answer is compared with teacher answers. Frequent answer types are numbers or single words (Reference Mayer and HertnagelMayer & Hertnagel, 2009). In open questions, it is crucial to foresee or prevent input errors such as typos, to assess the input content rather than correct spelling.
Moodle offers a format for digital exams called Quiz. This activity includes different types of open and closed questions with automatic evaluation. Here are some question types we used for the implementation described in this paper:
The type Multiple-Choice is available in different versions with various gradings. A variant is called Single-Choice and allows only one answer, while Multiple-Choice allows students to select more than one answer. Answers are often graded full points for choosing all correct answers and partial points for too few or many chosen answers. Another variation of this question types is All or Nothing. Learners receive 100% of the points for choosing all options correct and 0% if one or more answers are falsely selected or unselected.
Moodle offers several question types to input numbers. Numerical is for numbers with limited digits. STACK allows numbers or mathematical equations. It can generate variants of a question when provided with intervals for input values instead of discrete values. Furthermore, it allows several inputs and any number of response evaluation trees for multi-step calculations and identifying subsequent errors. Next to numerical or algebraical inputs, it is possible to let students input booleans (true or false) or select an item from a list. Depending on the input type, a comparison value is defined, and a permitted deviation is specified. Figure 1 shows one STACK task with the inputs and the validation and the resulting point.

Figure 1. Question type STACK in a Moodle activity quiz
The validation of input 2 is detailed. First node is the comparison of the input with the correct answer. If that is true, the students receive the full point for this input field. The students receive full point as well, if a subsequent mistake is identified. The system calculates a correct answer based on the inputs before. If there is no subsequent mistake identified different anticipated mistakes are checked and grated with partial points. The students receive zero point, if no one of the determined solution options are detected.
2.2. Transformation of paper-based exam tasks into automatic evaluable digital exam tasks
Intended Learning Outcomes are key in the development of assessment tasks (Reference Biggs and TangBiggs & Tang, 2011). The taxonomy based on Bloom developed further by Krathwohl are widely used in higher education and categorize cognitive learning outcomes on six levels remember (1), understand (2), apply (3), analyze (4), evaluate (5) and create (6) (Reference KrathwohlKrathwohl, 2002). On each level are typical verbs for the wording of the intended learning outcomes e.g. enumerate on the first level of remembering or describe on the second level of understanding. Literature relates learning outcomes on different levels to suitable assessment forms and typical task and question types.
In general, closed question types are more suitable for assessing learning outcomes on lower taxonomy levels, whereas open question types are needed for assessing learning outcomes on higher taxonomy levels (Reference StielerStieler, 2011). Different strategies support the transformation of the exam task support to break up with the general assignments of the level of learning outcomes and the question types. One strategy is the application of use cases for Multiple-Choice questions. In general, this question type is for assessing learning outcomes on lower taxonomy levels according to Bloom e.g. choosing the correct function of a machine element or choosing relevant influencing factors of permissible equivalent stresses of a shaft. The application of use cases allows assessing higher taxonomy levels with Multiple-Choice (Reference Mayer and HertnagelMayer & Hertnagel, 2009), e.g. choosing a suitable shaft-hub-connection for a described used-case.
Another strategy is splitting long complex exam tasks assessing learning outcomes on high taxonomy levels into several small exam tasks assessing learning outcomes on different taxonomy levels. For the development of the tasks a learning outcome is allocated in the taxonomy and required learning outcomes on different levels for the higher learning outcome are identified. And then for every identified learning outcome is at least one task defined and typical question types, Multiple-Choice for lower levels and numeric inputs for higher levels can be used (Reference Violante, Moos and VezzettiViolante et al., 2020). Next to extending the digitalization opportunities of exam tasks this strategy has advantages for the students in the field of engineering design education.
A typical exam task is the two-dimensional representation of a component including information about dimensions and tolerances based on a given 3D-Modell. Literature criticizes the association between the different steps. If students didn’t understand the morphology of the component, they have no chance to do correct dimensioning or tolerancing. Splitting a large task into small autonomous ones for each learning outcome reduces the preconditions and focus on the single learning outcomes. (Reference Baronio, Bodini, Copeta, Dassa, Grassi, Metraglia, Motyl, Paderno, Uberti, Villa, Cavas-Martínez, Eynard, Fernández Cañavate, Fernández-Pacheco, Morer and NigrelliBaronio et al., 2019)
The transformation from paper-based exam task for Engineering Design B at Ruhr-University Bochum into digital automatic evaluable ones in Moodle is for some tasks more for other tasks less based on these strategies. Students in engineering design education at Ruhr-University Bochum are unexperienced with digital exams and do not know STACK input syntax with mathematic equations. Therefore, the exam tasks require inputs with low potential to do syntax mistakes. That includes in this case numbers only and no algebraic equations. In addition, the task definitions must determine the physical units e.g. millimeters or Newton for the required inputs.
2.3. Typical paper-based tasks in engineering design education
A typical exam for the course Engineering Design B at Ruhr-University Bochum could consist of four dimensioning tasks and one technical drawing in a total time of three hours. Students work on the paper-based tasks on-site at the university. The individual exam tasks focus each on one machine element and consist of different task parts.
The first task implies the strength verification of a gear shaft according to the Germany guideline DIN 743. The paper-based task consists of three subtasks. The first subtask asks for the calculation of the approximate shaft diameter. The second subtask requests nominal, medium, deflection and maximum tensions for three load types. The third subtask consists of the calculation of the safety by taking the corrected tensions into account. Figure 2 illustrates the task. Information about the context of application with the for the calculation required data: loads, material properties and influencing factors and notch effect numbers.

Figure 2. Simplified representation of the paper-based exam task about a strength verification
A paper-based exam task about bearings contains the dimensioning of a locating and non-locating-bearing and consists of two subtasks. The first requires choosing a suitable cylinder rolling bearing as a non-locating bearing for a described context and verify, that the chosen one is the smallest possible one. That means, students work with a bearing catalogue and execute between two and six calculations. The second subtask is similar but about the locating bearing. In a task about feather keys students choose a suitable feather key according to the standards and verify if the safety for this choice is high enough. Another task about bolts asks for the calculation of the safety of certain bolts in a use context. The calculation includes the specification of the elasticity of the bolts resulting from the individual elasticities. The tasks show that the answers for the calculation of machine elements include mathematical formulas, values with their measurement units and written text for the chosen bearing or the evaluation. Next to the dimensioning of four different machine elements there is one task about designing a shaft with bearings and a hub including a suitable shaft-hub-connection. The task consists of a text describing the use case and obligatory machine elements are measures for the design. The use context describes the requirements of this design, and the available machine elements restrict the possible designs. The solution format for this task is a Technical Drawing by hand.
2.4. Digital automatic evaluable tasks for engineering design education
The developed digital tasks aim to assess the same intended learning outcomes as the paper-based tasks. The transformation requires only numerical inputs to prevent input errors. As a result, the inputs predefine the calculation path and need to include units. All anticipated mistakes must be calculated and compared with the input to assess them similarly as in the paper-based exam, when checking the equation and the results for assessment.
Depending on the tasks, especially if a selection is required, the transformation of the paper-based task in a digital version is difficult, because the different alternative solutions need to be assessed with all possible predefined mistakes. Therefore, some tasks are divided into individual independent tasks for the different intended learning outcomes e.g. the selection and the calculation are divided into two different tasks. So, the transformation of the existing paper-based exam tasks varies. Two paper-based exam tasks are divided into several small independent ones according to the splitting strategy, because the different selection opportunities are complicated to implement for automatic assessment, e.g., choosing a bearing in a catalogue containing thousands of bearings and calculate that single bearing.
The number of inputs and different questions in every task varies, because we implemented two other paper-based exam tasks each as a single task with several inputs and without selection opportunities. The first task implies the strength verification of a gear shaft according to DIN 743. Figure 3 presents the digital exam task, which figure 2 presents in the paper-based format.

Figure 3. Simplified representation of the digital exam task about a strength verification
The digital task includes 17 numeric inputs expressed in predefined physical units and one selection list. 18 decision trees with 113 nodes assess the different inputs. The number of necessary nodes and the checking mistakes or information for the inputs vary and has up to nine nodes for one input. The first input, the calculation of the approximate shaft diameter, has eight nodes. The first node checks if the input is not empty. That is technical necessary for partial points in the full task, if students only filled out other input boxes. The second node compares the input with the solution. If that is true, students receive full points. The second and third node compare the input with anticipated typical unit mistakes. The 5th to 8th node compares the input with other typical mistakes in the calculation. If one of these six nodes is true students receive 60-80% of the full point. All inputs are checked with a tolerance of 5%. For this input is no subsequent mistake taken into consideration, as there has been no input before, which is used for this input. The validation of the selection list has four nodes. The first node checks if an input is given. The second node checks, if the students filled out the relevant inputs before to evaluate the safety against flow. This prevents given points for guessing. Then the students’ input is compared with the correct answer and in another node with a subsequent mistake calculate based on students’ inputs before. This task is an example for the transformation of the paper-based form into the digital form with quite less adjustments based on the technical opportunities.
The digital exam task about bearings consists of three separate tasks. In the first Multiple-Choice question students choose which concrete bearing are suitable for a described context. Figure 4 illustrates the question and explains the wrong answer opportunities (distractors). The learning outcome addressed by the Multiple-Choice question is on the taxonomy level of evaluate. Students select suitable bearings for a described use-case. For answering they work with a bearing catalogue and remember and apply relevant criteria for choosing a bearing.

Figure 4. Multiple-Choice task about choosing a suitable bearing
The other two questions about bearings are type STACK. The first question has four numeric inputs and is about the recalculation of a single row deep grooved ball bearing with given loads. The second question consists of the calculation of the dynamic load and load rating of a specific described bearing type and has four inputs as well.
The two detailed exam task demonstrate the differences of the transformation between as similar as technical possible and divided into sub tasks. The other two tasks about machine elements aren’t detailed here but the digital tasks concluded: The tasks about a feather key consists of five Multiple-Choice questions with, which address learning outcomes on the level of remember and understand the function and the geometry of a feather key as a standard element. This learning outcomes on low taxonomy levels are necessary to know for the calculation. The paper-based exam format assesses these implicitly in the task. In addition to the Multiple-Choice questions the digital format includes one huge question type STACK for the calculation with 28 inputs and three selection lists for the interpretation of the results. The tasks about bolts consist of four tasks. The resilience of one bolt connection is assessed by a question type STACK with six inputs. One Multiple-Choice question and one short answer question check additional calculations for the identification of a full safety calculation. The last subtask includes the safety calculation, but already specify different essential values like the resilience. This last subtask is from type STACK and includes 12 inputs.
The transformation of paper-based exam tasks into automatic evaluable digital exam tasks in this section demonstrates the opportunity to access learning outcomes on higher taxonomy levels according to Bloom with existing automatically evaluable question types. For the digital assessment tasks students need to analyze the use cases to decide how to proceed, apply equations and approaches for calculation and evaluate their results.
3. Comparison of paper-based and digital exam results
We used the exam tasks for assessments of similar student groups. This section presents our method to collect and evaluate the data and the results of applying this method to our data sets.
3.1. Data collection and analysis
For the investigation of the research question, we compare exam results data from the same course Engineering Design B at Ruhr-University Bochum. The course is assessed with a final summative exam. This summative exam was paper based in summer term 2023 and a digital exam in summer term 2024. Both types of exams were written on-site with observation and the same limited number of pages for a formula collection was allowed. In addition, students could use the same bearing catalog for identifying suitable bearings. Participants of the course study the degree programs Mechanical Engineering or Sales Engineering and Product Management.
Students in summer term 2024 participating in the digital exam could use a digital introduction test in the course in the learning management system in about two weeks before the exam to get familiar with the different digital task types. They got to know the awaited input formats e.g. expected number of decimal digits, and could test, how different inputs are evaluated. The evaluation of subsequent errors is also illustrated in this before the exam date. However, all relevant information is included again in the implementation of the digital exam in an introduction before the tasks.
For the comparison we only take data from students from the first year of studying into consideration, because the year of studying has a significant impact on the exam results (Reference Kossack, Kattwinkel and BenderKossack, Kattwinkel et al., 2024). These data sets are analyzed with the statistic software tool SPSS (IBM Corp., 2021). Due to the different number of tasks for a certain machine element and the varying point for the tasks, we add all task for each topic, what means each machine element and standardize the points in reached percent. For identifying difference between the two test groups, group 1 results of the paper-based exam and group 2 results of the digital exam, initially mean and standard deviation are compared. In addition, the data are investigated for statistically significant differences with the Mann-Whitney-U-Test. This test is suitable for validating the null hypothesis. This assumes that there is no real difference between the groups that differ in terms of a characteristic, this case the exam format (Reference Rasch, Friese, Hofmann and NaumannRasch, 2021). The rejection range is defined by the significance level, which is usually set at 5% (Hollenberg, 2016; Moosbrugger & Kelava, 2012).
3.2. Results
Figure 5 shows the score of the four different tasks, which are implemented online and evaluated automatically in the recent exam. In the tasks on shafts, the group of students, which participated in the paper-based exam, achieve a 3,7% better mean. The standard deviation is the same with 37%. The task on shafts and hub connections, in this case about feather keys, shows a difference in the mean of 7,8%. Students who have completed the digital exam scored higher. The standard deviation is slightly smaller in the online format with 23% compared to the paper written format with 25%. In the tasks about bearing the group with the paper-based exam reached better results with an 8,1% higher mean. There is hardly any difference between the standard deviation with 30% for the group with the paper-based exam format and 31% with the digital exam format. The mean values for the tasks about bolts differ by 0,3% and the standard deviation is 0,4% higher in the group passing the paper-based format.

Figure 5. Comparison of scored points in tasks about four different machine elements mean and standard deviation of the two test groups
Table 1 shows the comparison of the two test groups to analyze, if there are any statistically significant differences between the two group. The Mann-Whitney-U-Test is suitable as the test groups have similar standard deviations and are independent samples (compare Reference Rasch, Friese, Hofmann and NaumannRasch, (2021)). All significances are larger than the rejection range of 0,05, i.e., there are no statistically significant differences between the test groups in any of the tasks.
Table 1. Mann-Whitney-U-Test comparing the test groups paper-based exam results (n1) and digital exam results (n2)

There is no difference in the format of the fifth exam task. Students draw by hand their results for a design task of a shaft with suitable bearings and shaft-and-hub connections. The comparison of the two test groups in this task shows, that the students in summer term 2024 are slightly better with a mean of 54% then students in the summer semester 2023 with a mean of 48%. This difference is not statistically significant with U (n1=66, n2=63) 1787,5, z=-1,374 and p=0,170. Variations with up to 8% of the points are usual in different paper-based exam formats in several years.
4. Discussion
The results show no significant differences between the results of the paper-based exam format and the digital exam format in the four tasks. Data doesn’t support students concerns that digital exam formats lead to worse grades then paper-based formats. It doesn’t support teachers’ concerns that digital exam formats especially with automatic evaluable and closed question types are easier for the students either. From our data, we conclude that digital automated assessment is possible for dimensioning tasks in engineering design education without loss of assessment quality.
The needed transformation of the paper-based exam tasks into digital automatic evaluable ones for this study enables the deviation of recommendations for the design of digital exam tasks for calculations in engineering design education. The use of the strategy to split long complex tasks with many steps into small task simplified the implementation, because the consideration of less subsequent mistakes is necessary. In addition, the splitting is technical required with the used software opportunities for selection tasks.
Considering the students’ results the data hardly show any difference for the varying ways of implementations. The task about feather keys has better results in the digital format with including closed question types addressing learning outcomes on low taxonomy levels. So, there might be the tendency, that closed questions assessing learning outcomes on low taxonomy levels lead to better results. On the other hand, none of the students got full points for the complex Multiple-Choice question shown in figure 4 assessing a learning outcome on higher taxonomy level. Students received in the task about shafts transformed as one big digital task worst results then in the paper-based format, they also received less points in the digital task about bearings then in the paper-based, which is divided into subtasks.
The results in this study show the general potential of digital exam tasks for calculations in engineering design education and can reduce teachers’ and students’ concerns about this assessment format, but the conclusions are experience-based and limited to one data set.
For comprehensive insights about the exam results of digital exam formats and which kind of transformation is more suitable for engineering design education or leads to better results, rather one huge task or splitting into several small tasks with the use of close questions, further comprehensive investigations are required. The following points are limitations in the presented study and could be considered for further work.
The paper-based tasks and the digital tasks must address the same learning outcomes. To what extent they really assess the same competencies could be examined with an extra comparison task, which needs to be completed by both test groups of students and the data of every individual student need to be analyzed and compared with the actual exam results.
The order of the exam tasks should be varied. In the presented data only about 20% of the participants in the exam worked on the tasks about bolts, while 100% worked on the first task about the strength verification of a shaft.
Not only analyzing the overall percent of points per topic, but also the existing mistakes in the paper-based and the digital format and the comparison could show differences. Maybe the test group with the paper-based exam format has more often no answer, because they didn’t know the next step. In the digital exam format the next step might be more prescribed with the given inputs, but the students make mistakes with the prescribed units or the correct decimal entries. Maybe the test group with the paper-based format gets partial points for answers that get no points in the digital format.
Different test groups from different study years with different teachers and from different universities would reduce the impact of certain existing teaching and learning activities on the exam results for the individual topics at one university in a certain study year. However, it should be noted that teaching and learning activities always prepare for the assessment task according to the Constructive Alignment and might need adjustment with changing the assessment format.
5. Conclusion
The data in this paper shows no significant differences between assessing competencies in engineering design education about the calculation of machine elements in a paper-based exam format and a digital exam format. The data of the results from one paper-based exam format and one digital exam format for a course at the Ruhr-University Bochum shows no statistically significant differences for four different topics. The paper includes the transformation of a paper-based exam in engineering design education addressing the dimensioning of machine elements to a digital format. This involves the analyses of technical opportunities for the implementation and the analyses of existing tasks to assess the intended learning outcomes of the course. The transformation of the individual tasks differs between nearly no difference in the tasks of the paper-based exam and the digital exam and splitting the paper-based exam task into several digital subtask. The data shows no correlation between the way of implementation and the results. However, the analyses include only one dataset with two test groups with the same tasks in every test group. For reliable insights more results from different test groups with varying tasks for the different topics and varying orders are necessary.
Beyond to the fact that there is no difference in the results of paper-based exam formats and digital exam formats, the design and used strategies for the transformation of paper-based exam tasks into digital automatic evaluable ones help not just for the design of digital exam tasks, but also for developing digital exercises and self-assessments.
The findings demonstrate the suitability of existing automatically evaluable question types for assessing competencies in engineering design education about dimensioning of machine elements. So, the application of these existing question types for assessing competencies about the calculation of machine elements could be expanded in exams for varying learning outcomes and at different universities.