ELT: Criterion-Reference Language Testing

Home > ELT > Criterion-referenced Language Testing > Sample report: Method

Sample report: Method

Participants

The students in this study included 64 enrolled in the three sections of the ELI 72 reading course at UHM during the Fall semester and 152 enrolled in all the sections of the ELI 82 reading course during the Spring and Fall semesters. Students were randomly assigned to Forms A or B within sections during the pretests and then assigned to the opposite form for the posttests. Exactly the same sets of tests (described in the next section) were used in both semesters.

Procedures and Materials

The objectives and resulting Criterion-referenced tests differed in organization and form for the two courses, but the processes involved in developing the tests was the same and followed these steps:

1. In each case, the process began with a thorough needs analysis for each course.
2. The tentative sets of objectives were established.
3. The items were written to match the objectives by the teachers in each level working together as part of their overall curriculum development responsibilities.
4. The items were piloted in the students' classrooms by administering them during the second week of class and again during final examination week.
5. The tests were revised by selecting those items which worked well and eliminating or modifying the other items. In some cases, the curriculum was modified as well, based on what we learned from the Criterion-referenced tests.

Recall that the CRTs written by the teachers differed in organization and form in the intermediate and advanced reading courses. All decisions about test methods and content were the responsibility of the teachers, the most common method used in the reading courses was the multiple-choice format. For instance, a typical multiple-choice item might be the following "inference" item, which was used in the directions on the lower-level reading course test:

Out of the darkness of the cold, wintry night came the clatter of a toppled garbage can lid. Startled, Peter dropped his book and ran to the back door.
Ex.1 What was Peter doing before he heard the noise?

A. singing C. washing
B. reading D. sleeping

Of course, the passages in the tests themselves were much longer and much more academic in content.

Unfortunately, because of practical constraints (particularly the need to turn the scoring around quickly so the tests would be maximally useful), we have tended to favor test formats that can be scored by machine. However, as we learn more about our Criterion-referenced tests and how they relate to the objectives of each course, we will no doubt gain confidence and begin experimenting with more imaginative test types. For example, we have recently begun to experiment with task-based subtests that students must do in their own time at the library. The plan is to assign such library tasks during the first and last week of classes and score them in conjunction with the students’ in-class diagnostic tests and final achievement tests. Their answers on these tasks will probably be scored for accuracy and completeness.

Analyses

An assortment of different testing statistics were used to analyze our reading tests. We adopted statistics liberally from educational and psychological measurement including the literatures on: (a) classical test theory (traditionally used for norm-referenced testing), (b) Criterion-referenced testing, (c) generalizability theory (G-theory), and item response theory (IRT). All analyses were done on an IBM computer using a spreadsheet program called QuattroPro (Borland, 1989) and a test analysis program called TESTAT (SYSTAT, 1987). These programs were neither expensive nor difficult to use, so clearly, the technology required is well within the reach of most language programs in terms of the resources and abilities needed to acquire and use them. Four sets of analyses were used: descriptive statistics, item statistics, consistency estimates, and validity strategies as follows:

1. Descriptive statistics included: the mean, standard deviation, range, number of items, and number of subjects

2. Item statistics included traditional NRT statistics (like item facility and item discrimination), CRT statistics (including the difference index, item phi (Ø), the B-index, and item agreement), and IRT statistics (including item difficulty and discrimination estimates).

3. Consistency estimates included NRT methods (Cronbach alpha, split-half adjusted, and Guttman estimates), and CRT methods (including both the phi domain-score dependability index and phi(lambda) squared-error loss agreement coefficient). Two other consistency indicators were also used: the NRT standard error of measurement and the analogous CRT confidence intervals.

4. Validity was considered in terms of content validity and construct validity. The later strategy involved both the intervention and differential groups perspectives.