Examinations are a key component of medical studies and are used as a tool for learning and for assessing whether the students possess the requisite knowledge (1). Since the examinations function as a quality control for further progress in medical studies and for graduation with an authorisation to practise medicine, it is necessary to define a limit for what is deemed satisfactory, or limits for different grades. This is called setting the standard of the examinations.
The examination standard can be defined by using relative or absolute methods. Relative standards are based on a welldefined group and the pass grade is based on total performance in this group. The group’s average point score minus one standard deviation is an example of a relative limit for a pass grade. Absolute standards are based on a predefined limit and are independent of the group’s total performance. Absolute methods are suitable for testing whether the students possess satisfactory competence for a specific purpose, such as for study progression or work as a doctor. The two most common absolute standardsetting methods are those created by Angoff and Ebel. Both methods are based on an assessment of the degree of difficulty of each examination question by an expert panel, and the sum of these estimates is defined as the limit for a pass grade in the examination (2, 3). The passgrade limit will hence vary from one test to another, based on their respective degree of difficulty. The use of expert panels is costly in terms of funding, time spent and organisational resources. The same passgrade limit from one year to the next would constitute a simpler absolute standardsetting method, but this approach would not take the degree of difficulty of the test into account.
Practices at the Norwegian University of Science and Technology (NTNU) mainly include an annual, integrated examination, meaning that all subjects taught in the year in question are tested in the same examination. The limit for a pass grade is predefined as at least 65 % correct answers, hereafter referred to as ‘absolute 65 %’. The failure rate for an examination varies from one year to the next, without any indication of major differences between the student cohorts in the admission requirements (4). Since only a single examination is held each year, the stakes are high and the students need to perform well. Rescheduled examinations (resit examinations) are held in August. If a student fails this examination as well, he or she must retake the entire year of study. A failure will thus entail major social and financial consequences for the student, and financial and organisational consequences for the medical faculty. Since examinations represent both a quality control instrument and a learning tool, and because the consequences of a fail grade are considerable, strict requirements must be upheld when it comes to the quality of the examinations. To have proper credibility, the standardsetting method ought to take the degree of difficulty of the examination into account (5).
Both relative and absolute standardsetting methods have their weaknesses. Absolute methods that involve expert panels are resourceintensive and hard to implement. Relative methods may result in unacceptably low limits for a pass grade. If some students fail to prepare for the examination, this will lower the average performance and thus also the limit to a pass grade. Another problem inherent in many relative methods is that someone will invariably fail, and this may mean that even students with sufficient knowledge will fail if the group as a whole performs strongly. Focusing on these weaknesses, CohenSchotanus and van der Vleuten developed a new method in 2010 (5). They claim that the academically strongest students represent one stable factor in the process of setting the standard. These students have read and understood the syllabus and prioritised their studies, but they will also be affected by the degree of difficulty of the examination. By using the academically strongest students as a reference group, they developed what today is known as Cohen’s method, hereafter referred to as ‘original Cohen’ (5). This method has later been revised by others (6).
We have compared the current 65 % absolute limit to a pass grade with two different Cohen methods (‘original Cohen’ and ‘modified Cohen’). We have investigated how these methods affect the failure rate and the standard deviation of this rate. Our hypothesis was that examinations with a high failure rate had a high degree of difficulty, while examinations with zero or very few failures were easier. We therefore assumed that a Cohen method that takes the degree of difficulty into account would result in fewer failures in difficult examinations and possibly more failures in simpler examinations, thus reducing the standard deviation of the failure rate as a whole.
Material and method
The data set
The data set for this study consists of examination results from the medical study programme at the Norwegian University of Science and Technology (NTNU). A written examination is held every other semester for the first four years, with no examination after the 9th and 10th semesters, but after both the 11th and 12th semesters. The examination consists of two parts; one part with 100–120 multiplechoice questions that have 3–5 response alternatives and a freetext/essay component with 3–5 main topics. The multiplechoice component counts for 60 %, while the freetext/essay component counts for 40 % (7). All ordinary written examinations for the years 2010 through 2015 were assessed, and all were included, except for the examination in the 11th semester of 2010 due to missing data. Table 1 lists the subjects that are tested in the different examinations. On average, there were 111 candidates in each examination and 34 examination sets were included, totalling 3 779 examination papers.
Table 1
Study year 
Semester 
Basic/paraclinical subjects 
Clinical subjects 

1st study year 
1st–2nd semester 
Cell biology 
Doctorpatient course in general practice 
2nd study year 
3rd–4th semester 
Structure and function of the nervous system 
Doctorpatient course 
3rd study year 
5th–6th semester 
Pathology 
Otorhinolaryngology 
7th–8th semester 
Pathology 
Emergency medicine 

6th study year 
11th semester 
General practice medicine 

6th study year 
12th semester 
Summary semester 
Estimation of the original and modified Cohen
Estimation of the ‘original Cohen’ starts from the point score of the students in the 95th percentile and defines the limit for a pass grade as 60 % of this score. In addition, it corrects for the possibility that the students may have guessed the correct answer. The formula for the ‘original Cohen’ is (5): Limit for a pass grade = cN + 60 (N* cN), where c is an estimate of the proportion of correct answers that are attributable to guesswork, N is the maximum score and N* is the score of the 95th percentile. We have corrected for guesswork using the same method that CohenSchotanus used in his study (CohenSchotanus, personal communication, 2016). The proportion of correct answers that are attributable to guesswork (cN) is estimated as follows: cN = (0.33 × A) + (0.25 × B) + (0.20 × C), where A, B and C are the proportions of answers that have three, four and five response alternatives respectively. ‘Original Cohen’ was created with a view to examinations that have only multiplechoice questions. Since the examinations at the Norwegian University of Science and Technology (NTNU) include a multiplechoice as well as a freetext/essay component, we correct for guesswork only in the multiplechoice component, while using the 95th percentile of the total point score for the entire set of questions (multiplechoice + freetext/essay component).
The ‘modified Cohen’ is estimated using the following formula (6): limit for a pass grade = K × Px, where K is the factor with which we multiply the score Px of the students in the given percentile. We entered different values for K (0.65, 0.70 and 0.75) and obtained the associated limits for a pass grade and failure rates. The choice of K values in our study is based on the existing 65 % limit for a pass grade and therefore investigates the failure rates around this limit. Taylor found that the point score of the students in the 90th percentile represented a better reference point than the 95th percentile that was used in the ‘original Cohen’ (6). The ‘modified Cohen’ method does not correct for guesswork.
Analyses
The following statistical analyses and estimates were made in Google Sheets 2016: average, median, failure rate, standard deviation (SD) of the failure rate, 90th and 95th percentiles and correction for guesswork.
Ethics
The examination results are available as anonymised data, and no individuals can be identified. An application for permission to undertake this study was thus not deemed necessary.
Results
The current 65 % method resulted in variations of up to 12 % in the failure rate for the same examination in the course of study during the period investigated (Figure 1, Table 2). For example, in the examination in the 5th–6th semester in 2010–11, there were no failures, whereas in 2015, altogether 11 students (12 %) failed (Figure 1a). The failure rate declined during the course of study (Figure 1b) and remained the same for all standardsetting methods.
Figure 1 a) Proportion of medical students who have failed the annual examinations at the Norwegian University of Science and Technology (NTNU) 2010–15 with the current standardsetting method of 65 % correct answers needed for a pass grade. b) Failure rate (%) of four classes monitored over time; class 09 started in 2009, class 10 in 2010 etc. The x axis shows the examinations from which the data have been retrieved, while the y axis shows the failure rate in per cent.
Table 2
Semester 
Examination 
Number of candidates 
Average score 
Median score 
Number of failures (%) 

1st and 2nd 
2010 
110 
78,7 
80 
6 (5.45) 
1st and 2nd 
2011 
117 
76,9 
79 
11 (9.40) 
1st and 2nd 
2012 
108 
77,9 
78 
4 (3.70) 
1st and 2nd 
2013 
117 
78,6 
76 
16 (13.68) 
1st and 2nd 
2014 
115 
74,2 
77 
15 (13.04) 
1st and 2nd 
2015 
114 
76,3 
77 
5 (4.39) 
3rd and 4th 
2010 
121 
74,4 
74 
10 (8.26) 
3rd and 4th 
2011 
113 
77,4 
78 
7 (6.19) 
3rd and 4th 
2012 
118 
73,7 
77 
14 (11.86) 
3rd and 4th 
2013 
114 
77,4 
78 
6 (5.26) 
3rd and 4th 
2014 
109 
75,1 
77 
9 (8.26) 
3rd and 4th 
2015 
114 
76,1 
77 
11 (9.65) 
5th and 6th 
2010 
103 
80,1 
81 
0 (0.00) 
5th and 6th 
2011 
110 
83,3 
84 
0 (0.00) 
5th and 6th 
2012 
107 
77,4 
78 
3 (2.80) 
5th and 6th 
2013 
99 
78,5 
79 
2 (2.02) 
5th and 6th 
2014 
105 
77,4 
79 
9 (8.75) 
5th and 6th 
2015 
92 
73,6 
74 
11 (11.96) 
7th and 8th 
2010 
112 
73,3 
74 
11 (9.82) 
7th and 8th 
2011 
119 
75,9 
77 
9 (7.56) 
7th and 8th 
2012 
111 
79,1 
81 
10 (9.01) 
7th and 8th 
2013 
112 
77,4 
78 
5 (4.46) 
7th and 8th 
2014 
105 
77,7 
80 
5 (4.76) 
7th and 8th 
2015 
111 
77 
77 
3 (2.70) 
11th 
2011 
118 
82 
77 
5 (4.24) 
11th 
2012 
107 
75,1 
76 
6 (5.61) 
11th 
2013 
113 
83,2 
85 
1 (0.88) 
11th 
2014 
108 
80,9 
81 
0 (0.00) 
12th 
2010 
109 
82 
82 
0 (0.00) 
12th 
2011 
118 
80,2 
81 
0 (0.00) 
12th 
2012 
118 
80,6 
81 
1 (0.85) 
12th 
2013 
106 
78,6 
79 
2 (1.89) 
12th 
2014 
115 
78,9 
80 
2 (1.74) 
12th 
2015 
111 
79 
79 
4 (3.60) 
The limit for a pass grade with the ‘original’ and ‘modified Cohen’ methods with K values of 0.65 and 0.70 was lower than with the ‘absolute 65 %’ limit. Use of a ‘modified Cohen’ method with a K value of 0.75 caused the limit for a pass grade to waver around the limit in use today (Figure 2). Using the ‘original Cohen’ the limit for a pass grade amounted to 57–65 % and with the ‘modified Cohen’ to 53–68 %, depending on the K value applied (Table 3).
Figure 2 Comparison of the limits for a pass grade in all examinations of medical studies at the Norwegian University of Science and Technology 2010–15 with use of different standardsetting methods. Each column represents an examination. The Y axis shows the limit for a pass grade, the X axis shows the various methods.
Table 3
Method 
Absolute 65 % 
Original Cohen 
Modified Cohen K = 0.75 
Modified Cohen K = 0.70 
Modified Cohen K = 0.65 

Average limit for a pass grade (%) 
65.0 
62.3 
64.7 
60.6 
56.0 
Standard deviation, limit for a pass grade 
0.0 
1.5 
1.5 
1.4 
1.3 
Range, limit for a pass grade (%) 
65.0 
58.1–64.7 
61.5–67.6 
57.4–63.1 
53.3–58.6 
Average failure rate % (n) 
5.2 (6) 
3.9 (4) 
5.0 (5) 
3.0 (3) 
1.7 (2) 
Standard deviation, failure rate 
4.2 
3.7 
4.4 
3.1 
2.1 
Range, failure rate (%) 
0–13.7 
0–13.7 
0–19.7 
0–10.4 
0–8.5 
With the exception of the examination in the first year of study, the failure rate was lower with both the original and modified Cohen methods when compared to the current method (Figure 3, Table 3). ‘Original Cohen’ produced the same range in the failure rate as the ‘absolute 65 %’ (0–13.7 %), but reduced the average proportion of failures from 5.2 % to 3.9 %. ‘Modified Cohen’ with K values of 0.65 and 0.70 reduced the standard deviation (SD) and the total failure rate. ‘Modified Cohen’ with a K value of 0.70 has an average failure rate of 3.0 % (SD 3.1) and a range of 0–10.4 %, while the ‘absolute 65 %’ has an average failure rate of 5.2 % (SD 4.2) and a range of 0–13.7 %. ‘Modified Cohen’ with a K value of 0.75 produced a wider range in the failure rate than the ‘absolute 65 %’ method, and thus also a higher standard deviation (Table 3). The standard deviation of the failure rate decreased with a lower limit for a pass grade.
Figure 3 Average failure rate in per cent including the standard deviation for each examination with use of the different standardsetting methods at the Norwegian University of Science and Technology (NTNU) 2010–2015. The X axis shows the examinations in the course of study, while the Y axis shows the average failure rate
Discussion
We found that the proportion of medical students who fail their examinations varies from one year to the next in the same examination in the course of the medical studies. We have shown that the standard deviation of the failure rate can be reduced by using Cohen’s methods, but that this comes at the cost of a lower limit for a pass grade.
The study shows that the failure rate declines as the medical studies progress. We have not investigated the causes of this phenomenon, but they are likely to be multiple. The medical studies programme in Trondheim practises spiral learning, meaning that the same topic is studied approximately every other year. For example, cardiac physiology with clinical examples is taught during the first year and cardiology in the third year, with summary of cardiology in the final year. Through this spiral learning the students will deepen their understanding of the subject. Furthermore, from upper secondary school the students have become accustomed to a clearly defined syllabus and frequent testing. The medical studies programme at the Norwegian University of Science and Technology (NTNU), on the other hand, has learning objectives with a number of recommended textbooks and only 1–2 annual examinations which may test anything. The change in study technique will thus be a major challenge to many students, although most of them appear to learn how to cope with this during their studies. In addition, the attrition rate is highest during the first two years of study, when 2–6 students need to retake a year, while only a maximum of one student needs to retake any later years. Of those who quit or have their admission revoked because of repeated examination failures, altogether 73 % (101 of 137 based on figures from 1999–2016) had not completed the second year of study (personal communication, Mona Dalland Stormo and Marte Laugen, Student and Academic Section at the Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology (NTNU). Other factors that can be assumed to contribute include experience of examinations, increasing age and the subjects being perceived as more relevant at later stages of the studies, which may help boost motivation.
The study by CohenSchotanus compared two cohorts at two different medical faculties in the Netherlands (5). One faculty used a referencebased method, and the average score minus one standard deviation was used to define the limit for a pass grade. This limit hence varied between 15 % and 46 %, while the failure rate remained relatively stable at 17 %. The other faculty used a predefined 60 % limit for a pass grade, and the failure rate amounted to 17–97 % (53 % on average). It is conceivable that the students who were subject to a higher limit for a pass grade were more knowledgeable. However, the students in these cohorts performed equally well in the national progress test which is implemented in six of the eight medical faculties in the Netherlands (5). However, students at the faculty with a 60 % predefined limit for a pass grade and a high failure rate spent on average one more year to complete their studies. Considering that these cohorts were equally knowledgeable in the national test, this indicates that predefined, absolute limits for a pass grade are a waste of public resources, and not least the students’ time and resources (5).
Both the original and modified Cohen methods reduced the standard deviation of the failure rate. With the use of these methods, fewer students would have failed. We were surprised to see that the opposite never occurred, i.e. that more students failed in examinations that nobody with an absolute limit for a pass grade failed. We believe that the Cohen method that ought to be chosen is the one that produces the largest reduction in the standard deviation of the failure rate, but the smallest change in the limit for a pass grade. The objective of this is to avoid lowering the difficulty level of the examinations while seeking to reduce the range of the failure rate. In our material, this would have been achieved with a ‘modified Cohen’ method and a K value of 0.70. This sets the limit for a pass grade at 70 % of the point score of the students in the 90th percentile.
It is difficult to assess the quality of a standardsetting method, since it is hard to tell where the ‘true’ limit to a pass grade should be for each individual examination. Cohen’s methods have the advantage of being predictable for the students, since they will know that they never need a higher percentage of correct answers than the stated K value (provided that those in the 90th or 95th percentile score perfectly). They will also know that the method corrects for the degree of difficulty of the test and that the test is not subject to the discretionary judgement of an examination committee. Let us assume that the faculty decides to use the Cohen method with a K = 0.70. Those who achieve 70 % of the scores of the students in the 90th percentile will pass, those who score lower will fail.
Another advantage of Cohen’s methods when compared to other relative methods is that they do not produce a fixed failure rate. We feel that it would be problematic to introduce a standardsetting method that lowers the existing 65 % absolute limit for a pass grade. Although the failure rate varies, the number of medical students who fail annually is nevertheless small compared to other study programmes (5, 6).
Absolute methods that involve expert panels are probably the solution most likely to produce the ‘true’ limit for a pass grade on a medical examination 2, 3). This solution is used in many places, including the United States Medical Licensing Examination (USMLE) (8). In practice, however, it would be hard to implement for all examinations in each medical faculty. Based on our findings, we believe that a 65 % absolute limit for a pass grade can be defended for as long as the failure rate remains as low as today. A standardsetting method needs to have credibility. If the variation in the failure range from one year to the next becomes excessive when testing a homogenous group of students assessed according to the admission criteria for medical studies, the examination loses its credibility (4). Cohen’s methods should be used in medical schools with an extremely high failure rate, or where there are major variations in the failure rate for the same examination in the course of study. We believe that the methods could be suitable at the Norwegian University of Science and Technology (NTNU) if the failure rate for examinations deviates considerably from what is common now.
Comments