 |
|  |
Sample report: IntroductionWhy Did We Include
the Sample Report?The sample report that follows is meant to serve several
purposes. First, it may serve as a model for reports that you may end up writing
on the development of your own CRTs. We hope it will serve you well in that regard.
It will no doubt need a good deal of adaptation and modification so that it will
suit your purposes and fit your situation. However, at least the skeleton of what
we include here should prove useful. Second, reading through this sample
report should also help you to review many of the concepts covered in this book.
Reading it may also serve as a kind of Criterion-referenced achievement test for
you, giving you feedback on what you did or did not understand in the book. Hopefully,
you will find that the report is amazingly clear to you and you will realize that
you have learned a great deal from reading this book. If not, reading this report
may serve more as a diagnostic test: those areas that are clear to you, you have
learned; those areas you do not understand, may require some review. IntroductionCriterion-referenced
Testing in Two Academic Reading Courses James Dean Brown University of
Hawaii at Manoa During the time that the author of this report was director
of the English Language Institute (ELI) at the University of Hawaii, the curriculum
was extensively revised in number of steps including: analysis of needs, development
of goals and objectives, creation of Criterion-referenced tests and materials,
improvements in teaching practices, and regularly conducted formative evaluation
procedures (as explained in Brown, 1995a). At that time, the ELI offered seven
service courses in academic listening, reading, and writing for students who were
fully matriculated into the university. This paper reports on the Criterion-referenced
test development portion of the curriculum development process for two of the
courses, ELI 72 and ELI 82 (Intermediate and Advanced Academic Reading, respectively). Each
of the two ELI reading courses has two forms of a Criterion-referenced test designed
to measure the particular objectives of the course in question. These two forms
are administered at the beginning of each course for diagnostic purposes and at
the end of the course for achievement purposes in a counterbalanced manner (so
that no student takes the same test twice). This testing project is reasonably
large in scale including four different tests administered at the beginning and
end of instruction for hundreds of students every year. While the objectives and
resulting tests are different in organization and form for the two courses, the
processes involved in developing, piloting, revising, administering, and scoring
the tests are quite similar. This report describes the initial item development,
piloting, and revision processes in general terms. Then, the report describes
and explains the results of the administrations of these CRTs during Fall semester
1989. The report will provide descriptive and item statistics (including the difference
index, item Ø, B-index, an item agreement index) for each
form of the reading tests, as well as dependability estimates [phi and phi(lambda)],
and evidence for the content and construct validity of the tests. The report
will also discuss the problems encountered in developing such a Criterion-referenced
testing program as well as the curriculum benefits to be derived from such a CRT
development process. All foreign students admitted to the University of
Hawaii at Manoa (UHM) must report to the English Language Institute (ELI) for
clearance before they are allowed to register for classes. The purpose of this
clearance is not to punish the students, as some of them seem to think, but rather
to decide if they need to take any further ESL training while they are taking
their courses. They may be exempted from ESL courses altogether or complete between
one and six three-unit courses during the first year or two of their time at our
university. Figure 1: ELI Courses
| TOEFL |
Receptive skills |
Productive skills |
| Exempt | Listening |
Reading | Speaking |
| Writing |
| | 600 |
ELI 80 | ELI
82 | ELI 81 |
ELI 83 | |
ESL 100 | | | ELI
70 | ELI 72 |
| Grads. |
| U.Grads |
| | |
| |
ELI 73 | |
ELI 73 | | 500 | | | | | | |
To meet these above described needs, the ELI offers eight courses in
academic listening, reading, speaking and writing (see Figure 1). Notice that
a TOEFL range of between 500 and 600 is indicated down the left side of the figure
and that the courses are clearly organized into four skill areas and two levels.
Curriculum Development Between 1986 and 1991, the curriculum for
the courses shown in Figure 1 was considerably revised. The curriculum was systematically
renovated by doing a (a) thorough needs analysis, (b) improvement or development
of objectives, (c) revision of the placement and classroom tests, (d) materials
development, (e) enhancement of teacher support, and (f) cyclically organized
program evaluation procedures. Figure 2 shows these elements of our curriculum
development were related. Note that testing is exactly at the center of the model
and that program evaluation is depicted as constantly interacting with all the
other components of the development process. [For a book length elaboration of
this model, see Brown, 1995a.] Figure 2: Systematic Approach to Curriculum
Development in the ELI (adapted from Brown 1989b)

This
report centers on the testing component of that curriculum, in particular on the
development and implementation of Criterion-referenced tests. For the sake of
illustration, we will further narrow the purpose to examining the development
and use of Criterion-referenced tests for our two reading courses. Notice in the
model in Figure 2 that the arrows connect testing to all the other elements in
the curriculum, either directly or through other components. These connections
express the belief that tests, particularly Criterion-referenced tests, interact
back and forth with the course objectives and needs analysis as well as with materials
development, teaching, and program evaluation. Such interactions among the curriculum
elements, with Criterion-referenced tests at the center, are felt to be essential
to the revision and improvement of the entire curriculum. What are Criterion-referenced
tests?Richards, Platt, and Weber (1985) define a Criterion-referenced
test as: a test which measures a student's performance according to a particular
standard or criterion which has been agreed upon. The student must reach this
level of performance to pass the test, and a student's score is therefore interpreted
with reference to the criterion score, rather than to the scores of other students. That
definition is very different from their definition for a norm-referenced
test (NRT) which they say is: a test which is designed to measure
how the performance of a particular student or group of students compares with
the performance of another student or group of students whose scores are given
as the norm. A student's score is therefore interpreted with reference to the
scores of other students or groups of students, rather than to an agreed criterion
score. These in combination point to the most important difference
between norm-referenced and Criterion-referenced tests: the each students
score on a CRT is compared to a particular criterion level or standard (for instance,
if the passing score on a test is 70 percent, a student answering 73% correctly
would pass); in contrast, on an NRT, each student's score is compared to the performances
of all the other students in whatever group is designated as the norm (for instance,
if a students score is at the 86th percentile, that score is better than
86% of the other students, but worse than 14%, without reference to the actual
score, or percent, of items correctly answered). The key to grasping the
difference between CRTs and NRTs is found in the distinction between the words
percentage and percentile. The purpose of a CRT is to measure the amount of material
that the students so it makes sense to score the tests and report the results
to the students in the form of percentages, that is, the percentages of questions
students answered correctly. These percentage scores can then be directly related
to the material taught in the class and related to a previously established criterion
level for passing the test. In contrast, the purpose of an NRT is to measure
how each student's score is related to the scores of all the other students who
took the test, that is, the focus is on each student's position in the distribution
of scores. This type of score is most often easiest for students to understand
if it is expressed as a percentile score because percentile scores clearly reveal
the proportion above and below any particular student of interest. In sum,
CRTs are most commonly used to measure the amount of course material each student
knows or has learned, while NRTs are used to measure the relationship of each
student's score to the scores of all the other students. However, while the percentage/percentile
distinction is crucial, other differences between CRTs and NRTs do arise in at
least five other ways: (a) the kinds of things that they are used to measure,
(b) the testing purposes involved, (c) the distributions of scores that will result,
(d) the testing formats, and (e) the degree to which students know what content
to expect (for more information on these differences, see Brown 1989a, 1990a,
1990b, 1995b, 1996). In the last 20 years, the importance of the distinction
between norm-referenced and Criterion-referenced testing has increased considerably
in language testing circles (for examples, see Cartier 1968, Cziko 1982 &
1983; Hudson and Lynch 1984; Delamere 1985; Henning 1987; Bachman 1989 & 1990;
Brown 1984a, 1989a, 1989b, 1990a, 1990b, 1995, & 1996). In the educational
measurement literature, Criterion-referenced testing has been around even longer
(beginning with Glaser 1963). For instance, even a cursory examination of almost
any recent volume of the Journal of Educational Measurement or Applied
Psychological Measurement will show that it contains at least one issues related
to Criterion-referenced testing. More importantly to the ELI, the distinction
between NRTs and CRTs is becoming increasingly useful at UHM for developing, analyzing,
and revising the various types of tests that we need for admissions (the TOEFL
NRT), placement (the ELIPT NRT), diagnosis (classroom CRTs), and achievement decisions
(classroom CRTs). This report describes part of the Criterion-referenced
side of our curriculum. The following research questions are posed here to help
organize the description of the results of the Criterion-referenced tests in our
reading courses: 1. What are the descriptive characteristics
of the Criterion-referenced tests when used in the two reading courses? How do
they differ across levels of the reading courses? 2. What item statistics
are most useful for revising the Criterion-referenced reading tests in this context?
How do the NRT, CRT, and IRT (Item Response Theory) approaches compare in their
usefulness for analyzing and improving CRTs? 3. To what degree are these
Criterion-referenced reading tests consistent in what they measure? How do NRT
reliability and CRT dependability approaches compare in usefulness for such CRTs?
How do they differ? 4. To what degree are these Criterion-referenced reading
tests valid? What strategies are most useful for investigating the validity of
CRTs in such a practical testing situation?
|  |