Cambridge
View basketHelp
    Home > ELT > Criterion-referenced Language Testing > Sample report: Introduction
Criterion-Referenced Language Testing Homepage
Sample Report: introduction
Sample Report: method
Sample Report: result
Sample Report: discussion
Sample Report: conclusions

Sample report: Introduction

Why Did We Include the Sample Report?

The sample report that follows is meant to serve several purposes. First, it may serve as a model for reports that you may end up writing on the development of your own CRTs. We hope it will serve you well in that regard. It will no doubt need a good deal of adaptation and modification so that it will suit your purposes and fit your situation. However, at least the skeleton of what we include here should prove useful.

Second, reading through this sample report should also help you to review many of the concepts covered in this book. Reading it may also serve as a kind of Criterion-referenced achievement test for you, giving you feedback on what you did or did not understand in the book. Hopefully, you will find that the report is amazingly clear to you and you will realize that you have learned a great deal from reading this book. If not, reading this report may serve more as a diagnostic test: those areas that are clear to you, you have learned; those areas you do not understand, may require some review.

Introduction

Criterion-referenced Testing in Two Academic Reading Courses
James Dean Brown
University of Hawaii at Manoa

During the time that the author of this report was director of the English Language Institute (ELI) at the University of Hawaii, the curriculum was extensively revised in number of steps including: analysis of needs, development of goals and objectives, creation of Criterion-referenced tests and materials, improvements in teaching practices, and regularly conducted formative evaluation procedures (as explained in Brown, 1995a). At that time, the ELI offered seven service courses in academic listening, reading, and writing for students who were fully matriculated into the university. This paper reports on the Criterion-referenced test development portion of the curriculum development process for two of the courses, ELI 72 and ELI 82 (Intermediate and Advanced Academic Reading, respectively).

Each of the two ELI reading courses has two forms of a Criterion-referenced test designed to measure the particular objectives of the course in question. These two forms are administered at the beginning of each course for diagnostic purposes and at the end of the course for achievement purposes in a counterbalanced manner (so that no student takes the same test twice). This testing project is reasonably large in scale including four different tests administered at the beginning and end of instruction for hundreds of students every year. While the objectives and resulting tests are different in organization and form for the two courses, the processes involved in developing, piloting, revising, administering, and scoring the tests are quite similar. This report describes the initial item development, piloting, and revision processes in general terms. Then, the report describes and explains the results of the administrations of these CRTs during Fall semester 1989. The report will provide descriptive and item statistics (including the difference index, item Ø, B-index, an item agreement index) for each form of the reading tests, as well as dependability estimates [phi and phi(lambda)], and evidence for the content and construct validity of the tests.

The report will also discuss the problems encountered in developing such a Criterion-referenced testing program as well as the curriculum benefits to be derived from such a CRT development process.

All foreign students admitted to the University of Hawaii at Manoa (UHM) must report to the English Language Institute (ELI) for clearance before they are allowed to register for classes. The purpose of this clearance is not to punish the students, as some of them seem to think, but rather to decide if they need to take any further ESL training while they are taking their courses. They may be exempted from ESL courses altogether or complete between one and six three-unit courses during the first year or two of their time at our university.

Figure 1: ELI Courses

TOEFL
Receptive skills
Productive skills
Exempt
Listening
Reading
Speaking
Writing
600
ELI 80
ELI 82
ELI 81
ELI 83
ESL 100
 
ELI 70
ELI 72
Grads.
U.Grads
 
ELI 73
ELI 73
500      

To meet these above described needs, the ELI offers eight courses in academic listening, reading, speaking and writing (see Figure 1). Notice that a TOEFL range of between 500 and 600 is indicated down the left side of the figure and that the courses are clearly organized into four skill areas and two levels.

Curriculum Development

Between 1986 and 1991, the curriculum for the courses shown in Figure 1 was considerably revised. The curriculum was systematically renovated by doing a (a) thorough needs analysis, (b) improvement or development of objectives, (c) revision of the placement and classroom tests, (d) materials development, (e) enhancement of teacher support, and (f) cyclically organized program evaluation procedures. Figure 2 shows these elements of our curriculum development were related. Note that testing is exactly at the center of the model and that program evaluation is depicted as constantly interacting with all the other components of the development process. [For a book length elaboration of this model, see Brown, 1995a.]

Figure 2: Systematic Approach to Curriculum Development in the ELI (adapted from Brown 1989b)

Curriculum Development chart

This report centers on the testing component of that curriculum, in particular on the development and implementation of Criterion-referenced tests. For the sake of illustration, we will further narrow the purpose to examining the development and use of Criterion-referenced tests for our two reading courses. Notice in the model in Figure 2 that the arrows connect testing to all the other elements in the curriculum, either directly or through other components. These connections express the belief that tests, particularly Criterion-referenced tests, interact back and forth with the course objectives and needs analysis as well as with materials development, teaching, and program evaluation. Such interactions among the curriculum elements, with Criterion-referenced tests at the center, are felt to be essential to the revision and improvement of the entire curriculum.

What are Criterion-referenced tests?

Richards, Platt, and Weber (1985) define a Criterion-referenced test as:
a test which measures a student's performance according to a particular standard or criterion which has been agreed upon. The student must reach this level of performance to pass the test, and a student's score is therefore interpreted with reference to the criterion score, rather than to the scores of other students.

That definition is very different from their definition for a norm-referenced test (NRT) which they say is:

a test which is designed to measure how the performance of a particular student or group of students compares with the performance of another student or group of students whose scores are given as the norm. A student's score is therefore interpreted with reference to the scores of other students or groups of students, rather than to an agreed criterion score.

These in combination point to the most important difference between norm-referenced and Criterion-referenced tests: the each student’s score on a CRT is compared to a particular criterion level or standard (for instance, if the passing score on a test is 70 percent, a student answering 73% correctly would pass); in contrast, on an NRT, each student's score is compared to the performances of all the other students in whatever group is designated as the norm (for instance, if a student’s score is at the 86th percentile, that score is better than 86% of the other students, but worse than 14%, without reference to the actual score, or percent, of items correctly answered).

The key to grasping the difference between CRTs and NRTs is found in the distinction between the words percentage and percentile. The purpose of a CRT is to measure the amount of material that the students so it makes sense to score the tests and report the results to the students in the form of percentages, that is, the percentages of questions students answered correctly. These percentage scores can then be directly related to the material taught in the class and related to a previously established criterion level for passing the test.

In contrast, the purpose of an NRT is to measure how each student's score is related to the scores of all the other students who took the test, that is, the focus is on each student's position in the distribution of scores. This type of score is most often easiest for students to understand if it is expressed as a percentile score because percentile scores clearly reveal the proportion above and below any particular student of interest.

In sum, CRTs are most commonly used to measure the amount of course material each student knows or has learned, while NRTs are used to measure the relationship of each student's score to the scores of all the other students. However, while the percentage/percentile distinction is crucial, other differences between CRTs and NRTs do arise in at least five other ways: (a) the kinds of things that they are used to measure, (b) the testing purposes involved, (c) the distributions of scores that will result, (d) the testing formats, and (e) the degree to which students know what content to expect (for more information on these differences, see Brown 1989a, 1990a, 1990b, 1995b, 1996).

In the last 20 years, the importance of the distinction between norm-referenced and Criterion-referenced testing has increased considerably in language testing circles (for examples, see Cartier 1968, Cziko 1982 & 1983; Hudson and Lynch 1984; Delamere 1985; Henning 1987; Bachman 1989 & 1990; Brown 1984a, 1989a, 1989b, 1990a, 1990b, 1995, & 1996). In the educational measurement literature, Criterion-referenced testing has been around even longer (beginning with Glaser 1963). For instance, even a cursory examination of almost any recent volume of the Journal of Educational Measurement or Applied Psychological Measurement will show that it contains at least one issues related to Criterion-referenced testing. More importantly to the ELI, the distinction between NRTs and CRTs is becoming increasingly useful at UHM for developing, analyzing, and revising the various types of tests that we need for admissions (the TOEFL NRT), placement (the ELIPT NRT), diagnosis (classroom CRTs), and achievement decisions (classroom CRTs).

This report describes part of the Criterion-referenced side of our curriculum. The following research questions are posed here to help organize the description of the results of the Criterion-referenced tests in our reading courses:

1. What are the descriptive characteristics of the Criterion-referenced tests when used in the two reading courses? How do they differ across levels of the reading courses?

2. What item statistics are most useful for revising the Criterion-referenced reading tests in this context? How do the NRT, CRT, and IRT (Item Response Theory) approaches compare in their usefulness for analyzing and improving CRTs?

3. To what degree are these Criterion-referenced reading tests consistent in what they measure? How do NRT reliability and CRT dependability approaches compare in usefulness for such CRTs? How do they differ?

4. To what degree are these Criterion-referenced reading tests valid? What strategies are most useful for investigating the validity of CRTs in such a practical testing situation?

Sample Report: method