Testing for Language Teachers - Cambridge University Press

Home > English Language Teaching > Testing for Language Teachers > Additional Exercises

Catalogue

Resources

Contacts and Ordering

Additional exercises for Appendix 1: The statistical analysis of test data

1. Comment on the following two frequency tables, which are for the same test and which summarise the scores of two different sets of students...

Score	Frequency
1	0
2	0
3	3
4	1
5	6
6	9
7	14
8	16
9	18
10	20
11	15
12	12
13	14
14	10
15	6
16	3
17	0
18	0
19	0
20	0

Score	Frequency
1	0
2	0
3	0
4	0
5	1
6	2
7	0
8	5
9	7
10	6
11	9
12	12
13	14
14	19
15	27
16	40
17	32
18	26
19	18
20	12

View answers

2. Calculate the mean, mode and range for each of the two sets of data above.
View answers

3. Below are the facility values of ten items on a test. Which is the most difficult item? Which is the easiest? In what circumstances might you wish to include items as difficult and as easy as these two?

Item	Facility
A	0.67
B	0.92
C	0.67
D	0.56
E	0.54
F	0.67
G	0.42
H	0.23
I	0.19
J	0.06

View answers

4. Here are the discrimination indices for the same items. Which item discriminates best, and which discriminates least well? Comment on the discrimination indices in relation to the facility values.

Item	Discrimination
A	0.33
B	0.18
C	0.46
D	0.37
E	0.26
F	0.01
G	0.44
H	0.18
I	0.23
J	0.01

View answers

5. Look again at the frequency tables above. What effect would the difference in ability between the two groups have on facility values?

In the present case, the two groups have responded to the same items. What if they had responded to two different sets of items? How would you be able to compare the difficulty of the items in one set with the difficulty of the items in the other set?
View answers

Answers

Questions 1:
In the first frequency table, the great majority of test takers (128 out of 147) scored between 6 and 14. Nobody scored between 17 and 20.

In the second table, the great majority of test takers (209 out of 230) scored between 11 and 20, with only three people scoring less than 8.

The group whose scores are represented in the second table is clearly more able than the other group.

It is important to note that from these results alone we cannot be sure whether or not the test is appropriate for either or both of the groups. We would need to know the purpose of the test.

Question 2:
In the first table, the mean is 9.89 (1454 divided by 147); the mode is 10 (scored by 20 people); the range is 13 (16 minus 3). You may like to know that some people would reduce this range by one (they would say it is 12) because the number 3 extends to 3.5 and the number 16 begins at 15.5, if we assume the scale on which people score is really continuous.

In the second table, the mean is 15.26 (3510 divided by 230); the mode is 16; the range is 15 (or 14).

The higher mean and mode reflect the superior ability of the second group.

Question 3
The most difficult item is J; the easiest is B. We might want to have items as difficult as Item J in a test in order to discriminate between more able people. We might want items as easy as Item B in order to get test takers off to a confident start, or to discriminate between people of lower ability. Remember that 'easy' items or 'difficult' items are only easy or difficult in relation to the ability of the people who respond to them.

Question 4
Item C discriminates best. Items F and J discriminate least well. Item J may not discriminate very well because it is so difficult (Facility = 0.06), but we may still want to keep it in the test for the reason given above. Item F, on the other hand, is of only moderate difficulty and the most likely reason for its lack of discrimination between stronger and weaker test takers is that it is faulty in some way.

Question 5
Facility values calculated on the performance of the first group would be lower than those calculated on the performance of the second group. If each group had responded to a completely different sets of items, it would not be possible to make sensible comparisons between the facility values of one set of items and those of the other set. In order to make such a comparison (in order, for example, to construct a single test using items from the two sets), the two sets should include items common to both of them (known as anchor items). IRT analysis would then allow all of the items to be put on the same difficulty scale.