ELT: Criterion-Reference Language Testing

Home > ELT > Criterion-referenced Language Testing > Sample report: Discussion

Sample report: Discussion

To summarize, the results of this project indicate that our CRTs for the ELI reading courses are functioning reasonably well. The two forms appear to be working about the equally well in both levels of our reading courses. Unfortunately, they are acting more like norm-referenced tests than Criterion-referenced ones. The means appear to be fairly well centered and the scores are well dispersed around the means. In general, from a Criterion-referenced point of view, all four tests (i.e., both forms in both courses) should be too difficult for the students when they take the test at the beginning of the course and relatively easy at the end of the course. If the tests were difficult for the students at the beginning, i.e., they performed relatively poorly, that would indicate that indeed they need to study the objectives being tested. In contrast, if the tests were relatively easy for the students at the end of the course, that would indicate that they had learned a fair amount of the material. Some of the pattern just described did appear in our results in the form of gains on each and every test, but a stronger, clearer pattern would be more compelling in terms of defending both the tests and the curriculum they were designed to assess. In our defense, other programs do not have this problem largely because they never address it through systematic testing like we have. Also, the item selection processes and revisions that we have set in motion will be designed to improve this situation so that: (a) our tests will better reflect whatever learning is going on in the courses and (b) the tests will better help us make fair decisions about exemptions (at the beginning of the course) and about whether students pass or fail our courses (at the end of the course).

To those ends, all of the various item statistics are turning out to be very useful. The NRT item statistics are telling us about how our tests are functioning in terms with which we have long been familiar. In addition, the NRT statistics may turn out to be useful for converting items that do not prove useful in the Criterion-referenced tests into items for our NRT placement tests. In like manner, we expect the IRT analyses to be useful in setting up item banks, and in improving out pass/fail decisions. Our four tests also appear to be at least moderately reliable from the NRT perspective, especially in light of the restrictions of range that are involved in such testing. From a Criterion-referenced point of view, the tests appear to be moderately consistent in terms of domain score dependability (as shown by the phi coefficients). However, the dependability of these tests, as estimated by phi(lambda), seems to depend more on which decision is involved. The phi(lambda) estimates for the pretest exemption decisions (with the lambda decision level at .90) are all excellent ranging from .925 to .932. The phi(lambda) estimates for the posttest achievement pass/fail decisions (with the lambda decision level at .60) are lower ranging from .686 to .892. Further analysis of these results must be considered when we are making the actual pass/fail decisions, now and with revised versions of the tests. Furthermore, we must use the confidence interval statistics and obtain additional information about students who fall close to our cut-points—especially for those students who fall within one CI above or below the cut-point. Clearly, for pass/fail decisions, the CRT dependability approaches and the CIs are much more useful than the analogous NRT reliability estimates reported in the same table.

We must also keep the validity issue alive and not rest on our laurels. Yes, at the moment, we can say with some pride that all of the test items in this project were thoroughly scrutinized for content validity by the appropriate ELI teachers. We can also say that all four tests in this project showed some sensitivity to instruction. We should nonetheless learn from these experiences and revise the tests so they will be even more sensitive to instruction. We can do this by selecting those items that had high CRT item analysis statistics for future versions of the tests, but also by carefully examining the curriculum to insure that (a) objectives the students already know when they arrive are no longer included in the curriculum (and tests) and (b) the lessons in the reading courses are effectively addressing all of the objectives that are being tested.