Sample report: MethodParticipantsThe
students in this study included 64 enrolled in the three sections of the ELI 72
reading course at UHM during the Fall semester and 152 enrolled in all the sections
of the ELI 82 reading course during the Spring and Fall semesters. Students were
randomly assigned to Forms A or B within sections during the pretests and then
assigned to the opposite form for the posttests. Exactly the same sets of tests
(described in the next section) were used in both semesters. Procedures
and MaterialsThe objectives and resulting Criterion-referenced tests differed
in organization and form for the two courses, but the processes involved in developing
the tests was the same and followed these steps: 1. In each
case, the process began with a thorough needs analysis for each course. 2.
The tentative sets of objectives were established. 3. The items were written
to match the objectives by the teachers in each level working together as part
of their overall curriculum development responsibilities. 4. The items
were piloted in the students' classrooms by administering them during the second
week of class and again during final examination week. 5. The tests were
revised by selecting those items which worked well and eliminating or modifying
the other items. In some cases, the curriculum was modified as well, based on
what we learned from the Criterion-referenced tests. Recall
that the CRTs written by the teachers differed in organization and form in the
intermediate and advanced reading courses. All decisions about test methods and
content were the responsibility of the teachers, the most common method used in
the reading courses was the multiple-choice format. For instance, a typical multiple-choice
item might be the following "inference" item, which was used in the
directions on the lower-level reading course test: Out of the
darkness of the cold, wintry night came the clatter of a toppled garbage can lid.
Startled, Peter dropped his book and ran to the back door. Ex.1 What was
Peter doing before he heard the noise? A. singing C.
washing B. reading D. sleeping Of
course, the passages in the tests themselves were much longer and much more academic
in content. Unfortunately, because of practical constraints (particularly
the need to turn the scoring around quickly so the tests would be maximally useful),
we have tended to favor test formats that can be scored by machine. However, as
we learn more about our Criterion-referenced tests and how they relate to the
objectives of each course, we will no doubt gain confidence and begin experimenting
with more imaginative test types. For example, we have recently begun to experiment
with task-based subtests that students must do in their own time at the library.
The plan is to assign such library tasks during the first and last week of classes
and score them in conjunction with the students in-class diagnostic tests
and final achievement tests. Their answers on these tasks will probably be scored
for accuracy and completeness. AnalysesAn assortment of different
testing statistics were used to analyze our reading tests. We adopted statistics
liberally from educational and psychological measurement including the literatures
on: (a) classical test theory (traditionally used for norm-referenced testing),
(b) Criterion-referenced testing, (c) generalizability theory (G-theory), and
item response theory (IRT). All analyses were done on an IBM computer using a
spreadsheet program called QuattroPro (Borland, 1989) and a test analysis
program called TESTAT (SYSTAT, 1987). These programs were neither expensive
nor difficult to use, so clearly, the technology required is well within the reach
of most language programs in terms of the resources and abilities needed to acquire
and use them. Four sets of analyses were used: descriptive statistics, item statistics,
consistency estimates, and validity strategies as follows: 1. Descriptive
statistics included: the mean, standard deviation, range, number of items,
and number of subjects 2. Item statistics included traditional NRT
statistics (like item facility and item discrimination), CRT statistics (including
the difference index, item phi (Ø), the B-index, and item
agreement), and IRT statistics (including item difficulty and discrimination estimates).
3. Consistency estimates included NRT methods (Cronbach alpha, split-half
adjusted, and Guttman estimates), and CRT methods (including both the phi domain-score
dependability index and phi(lambda) squared-error loss agreement coefficient).
Two other consistency indicators were also used: the NRT standard error of measurement
and the analogous CRT confidence intervals. 4. Validity was considered
in terms of content validity and construct validity. The later strategy involved
both the intervention and differential groups perspectives.
|