Cambridge
View basketHelp
    Home > ELT > Criterion-referenced Language Testing > Sample report: Results
Criterion-Referenced Language Testing Homepage
Sample Report: introduction
Sample Report: method
Sample Report: result
Sample Report: discussion
Sample Report: conclusions

Sample report: Results

Descriptive Statistics

Table 1 presents the descriptive statistics for this project. The different statistics are labeled by headings across the top of the table including: the number of students on each test (N), the number of items on each (k), the mean (M), the standard deviation (SD), minimum score (Min.), maximum score (Max.), and the range. The reading course tests are labeled in the first column. The rows are ordered first according to course (ELI 72 or ELI 82), then according to whether the test was a pre or post test, then according to form (A or B).

Table 1: Descriptive Statistics

Reading course
Test
NkMSDMin.Max.Range
ELI 72        

PreA

35 4631.115.18153925

PreB

29 4630.905.47 16 4126

PostA

26 4634.734.862042 23

PostB

35 46 33.57 3.81284114
ELI 82       

PreA

87 34 21.05 3.95 133119

PreB

65 3421.26 3.92 103021

PostA

63 34 23.44 3.94 1431 18

PostB

67 34 23.12 3.901431 18

Mean Item Statistics

Table 2 presents the mean item statistics for this project. Notice that the labels across the top first show that NRT, IRT, and CRT estimates are all included. Naturally, examining the individual item statistics was much more important for purposes of revising the tests, but mean item statistics are the only practical means to provide readers with an overview of current item quality of the tests in this project. The mean statistics for the NRT items included classical test item facility and item discrimination indices

Table 2: Mean Item Statistics

Reading        
Statistics
 
Course
NRT
 
IRT
 
CRT
Test IF   ID   P   Diff.   Disc. Used   DI   Item   B A
ELI 72                                  
PreA
.68 .23 .67 -1.46 .38 45 (35)* .08 .07 .18 .34
PreB
.67   .25   .67   -1.31   .44 46(29)*   .06   .09   .23 .36
PostA
.76   .20   .74   -2.01   .42 44(26)*   .08   .18   .33 .76
PostB
.73   .19   .71   -1.96   .38 43(35)*   .06   .00   .00

.73

 

ELI 82                                  
PreA
.62   .29   .62   -0.99   .36 34(87)   .07   .07   .30 .39
PreB
.63   .28   .63   -1.14   .37 34(65)   .05   .07   .26 .39
PostA
.69   .25   .69   -1.54   .40 34(63)   .07   .20   .21 68
PostB
.68   .25   .68   -1.45   .38 34(67)   .05   .19   .17 .64

* Either an item or person (or both) was deleted because 0% or 100%.
** Cut-points were set at .90 for pretest decisions and .60 for posttests.

The mean statistics for the NRT items indicate that our tests look statistically very much like norm-referenced tests (say for placement purposes). In fact, if we set about to select items for a revised test on the basis of these NRT item statistics, the tests would eventually become relatively powerful NRTs. Instead, we have chosen to use the two other types of item analysis for revising our tests for Criterion-referenced purposes (IRT and CRT), and we will continue to do so.

Because of the relatively small sample sizes involved here, the IRT item estimates are based on a one-parameter model. Our purpose in using IRT was to include these item difficulty estimates in our selection processes. Note that the mean IRT difficulty estimates (Diff.) are negative in all cases, which indicates that the items were on average relatively easy for the students; fortunately this was more true on the posttests than on the pretests. Note also that the IRT discrimination estimates reported in Table 2 are the slopes (in a one-parameter IRT analysis, these are kept constant across all items). Caution must be used in thinking about these IRT results because the sample sizes are really too small to be appropriate even for the one-parameter model. We would have been much more comfortable if we had had at least 100 students for each form of each test.

Our primary reason for using the IRT analyses at all was so that we would be able to use the individual student ability estimates for examining appropriate cut-points for pass/fail decisions. The mean ability estimates were not given here because in all cases they were zero. We would also eventually like to be able to set up an item bank for each of these courses—a task for which IRT is particularly well-suited.

The CRT item statistics shown in Table 2 include the difference index (DI), item Ø, the B-index (B), and the agreement index (A) (see Shannon & Cliver, 1987; Berk, 1984b). The DI for each item is obtained by subtracting its item facility on the post test minus the facility for the same item on the pretest (that is, DI = IFpost - IFpre) . Item Ø is an estimate of the degree to which the students’ answers (right or wrong) are related to whether they passed or failed the test. The B-index is the difference between the proportions of correct answers on each item and the proportions of students passing and failing. The agreement statistic is "the proportion of consistent item-test outcomes" with regard to the particular students who correctly answered the item and passed the test, as compared to those who missed the item and failed the test. Hence the item agreement statistic is similar to the agreement coefficient used to investigate the overall consistency, or dependability, of tests when they are used for making decisions (see Cohen 1960; Subkoviak 1980, 1988).

Remember these CRT item statistics were used for individual item decisions, rather than as the mean item statistics shown in Table 2. Note also that we calculated each of the cut-point related item statistics for .50, .60, .70, .80, and .90 decisions. This strategy ended up being very useful in thinking about item selection in terms of the types of decisions we make with our tests, and the relative suitability of various cut-points for our particular decisions. We make two types of decisions based on these tests. Among other uses, the pretest results are used to identify those students who were initially misplaced in order to move them up a level or exempt them. In contrast, the posttest administrations are used mostly to judge whether students pass or fail the course in question. The decision level was tentatively set at.90 for pretest exemption decisions, and at about .60 for posttest pass or fail decisions. The statistics shown in Table 2 are therefore based on .90 for pretests and .60 for posttests. Ultimately, in the test revision process, we would like to keep those items which are strong for both types of decisions (pretest exemption and posttest pass/fail decisions).

Consistency Estimates

Both NRT reliability statistics and CRT dependability estimates are presented in Table 3. The NRT reliability estimates included here are the Cronbach alpha coefficient, the split-half method (adjusted by the Spearman-Brown prophecy formula) estimate, and the Guttman coefficient. Notice that these NRT coefficients are generally fairly low, ranging from .529 to .822. However, note also that the ranges of talent in courses such as these will generally be severely restricted by previous NRT selection procedures for admissions and placement. As Brown (1984b), Ebel (1979), and others have demonstrated, even an otherwise well-developed test will be unreliable if the range of talent is depressed in the testing population. Given that framework, the reliability estimates produced by these tests appears a bit more respectable, even from an NRT perspective. Another approach to reliability estimation is the standard error of measurement (SEM), which provides an estimate of the dispersion of errors around a particular score or cut point. The SEM is presented just to the right of the NRT reliability estimates just discussed. In this case, the SEM is based on the odd-even, or split-half (adjusted), coefficients. Note that the NRT perspective on reliability presented in this paragraph is largely irrelevant to this project because the tests in question are CRTs. These statistics are presented primarily for purposes of comparison and reference.

Table 3: Reliability and Dependability

         
Course  
NRT
   
CRT
 
 TestAlphaOdd-evenGuttmanSEMphiPhi*CI
RELI 72       
E

PreA

.704.814.8092.234.674.928.068
A

PreB

.750.822.8161.879.713.932.068
D

PostA

.713.785.7842.253.691.892.062
I

PostB

.573.661.6602.218.497.823.065
NELI 82       
GPreA.575.651.651.2.334.541.927.082
 PreB.586.722.722.2.068.546.925.082
 PostA.617.531.529.2.695.584.719.078
 PostB.587.650.649.2.305.562.686.079
         

* Cut-points were set at .90 for pretest decisions and .60 for posttests

From the CRT point of view, the phi and phi(lambda) coefficients are much more interesting. The phi coefficients provide domain score estimates of the dependability of our CRTs. The phi(lambda) coefficients provide decision consistency estimates based on the squared-error loss agreement approach (see Berk 1980c, 1984). Both the phi and phi(lambda) used here are based on the short-cut formulas presented in Brown (1990a). Like the CRT item statistics, the phi(lambda) estimates for the pretests are based on .90 cut-points, while those for the posttests are based on .60 cut-points.

Another statistic called the confidence interval (CI) is shown in the last column on the right side of Table 3. The CI is analogous to the SEM statistic and is the appropriate statistic for analyzing CRTs. The CIs for the tests in this project ranged from .062 to .082. The CI is best interpreted as the proportion of error that would be accounted for with 68 percent confidence around an individuals proportion score. For example, the CI in the upper-right corner of Table 3 would indicate that a person receiving a proportion score of .70 (or 70 percent) would score within plus or minus one CI, or a band from .632 (.70 - .068 = .632) to .768 (.70 + .068 = .768) 68 percent of the time. In percent score terms, this would simply be a band between 63.2 percent and 76.8 percent. Note that the confidence interval is derived from a generalizability theory statistic called the absolute error variance component (see Bolus, Hinofotis & Bailey 1982; Brennan 1980, 1984; Brown 1984c; Brown & Bailey 1984).

Validity

At least two validity strategies are practical and appropriate in studying CRTs: the content and construct approaches. This section of the paper will address both of these approaches. Content validity will be discussed in terms of how items were developed and construct validity will be examined in terms of differential groups and intervention studies that were conducted.

Content validity approaches all require, in one form or another, the systematic examination of the degree to which a set of test items approximate whatever content or abilities the test was originally designed to assess. At UHM, the items for the CRTs are always written to closely match the objectives of the courses. Since those objectives are the domain that is being tested, every effort is being made at the item writing level to match the items with the content and skills being taught in the courses. Hence, content validity is an integral part of the item development process. Since the course teachers do the actual item writing with feedback from the ELI director and lead teachers, the items are not only designed to match the objectives, but also to match those objectives as they are actually taught in our classrooms. Once we are fairly comfortable with the tests in terms of dependability and validity, we should turn to outside "experts" in order to get independent judgments of how well the items match our objectives.

Construct validity approaches all require experimental demonstration of the degree to which a test is measuring the psychological construct it claims to be measuring. Such experimental demonstrations many take may shapes, but for CRT development, intervention and differential groups studies are probably the most appropriate and practical.

An intervention study involves administering a pretest, then teaching the students whatever construct is involved and testing them again after the instruction. If the test is truly measuring the construct, the students’ scores should be significantly higher on the posttest than they were on the pretest. This whole procedure, called an intervention study, can provide one argument for the construct validity of a test.

In the present project, differences were found between pretest and posttest means for both forms of the tests in each course; these results indicate that some effect on the scores existed due to instruction. These differences, or gains, ranged from five to eight percent as indicated by the average difference indices (DI) reported in Table 2. The actual gains experienced by the students who took our courses were considerably higher for two reasons:

1. The results of this project include all students who took the pretest and posttest administrations. Those students who scored high on the pretest and were exempted from the courses did not take the posttest. Because the exempted students by definition scored high on the pretest, but do not figure into the posttest results, they would have the effect of diminishing the observed differences. In future analyses, such exempted students will be deleted from the pretest analyses, that is, only students who actually received instruction will be included in the analysis.

2. The tests being analyzed here have not yet been extensively revised to select those items which are most sensitive to instruction. When those items with the largest difference indices are selected and the tests, we expect much larger gains to found by the tests. This does not mean that the students will be learning more, but rather that the tests will be assessing the courses more sensitively.

The reader should note that we cannot take full credit for the observed gains, nor can we attribute them solely to the effects of our courses. The reality is that students were simultaneously being exposed to English from many other sources on a daily basis. In addition, many students were simultaneously taking other ELI courses which could have influenced their learning of English. Nevertheless, we can say that the observed differences between pretests and posttests reflect gains due to the total English language experience that students had during that semester at UHM.

For the above reasons, we felt it would be premature to perform statistical analyses of the current differences (using say t-tests or F ratios), especially before addressing the issues described in 1. and 2. above. Nevertheless, the intervention study approach to the construct validity of our CRTs is important to us. At the test level, we are examining gains in terms of the test means. At the item level, we are selecting the items that will remain on future revised versions of the tests based on the difference index (a kind of item by item intervention study). Thus we consider the construct validity approach particularly important for insuring that our CRTs are solidly related to the learning that is going on in our courses.

A differential groups study involves administering a test to two groups of students: one group that possesses the construct in question (sometimes called masters), and another group that lacks the construct (also known as nonmasters) (see Brown 1984c for an example of this approach). In this project, we have conducted differential groups studies in two ways:

1. First, we have compared the scores of students who passed the courses (masters) with scores of students who failed (nonmasters). Quite naturally, we found large differences between these two groups because passing or failing was partly determined by the test itself.

2. Second, we have examined the individual and mean item Ø, the B-index, and A estimates. These statistics indicated a fairly strong relationship between the accuracy of the students’ answers on individual items, and whether or not they passed the course (i.e., whether they were masters or non-masters).

Differential groups studies are especially important in thinking about the degree to which pass/fail decisions are valid and fair. Hence, they are integrally related to consequential validity.

We have found one aspect of CRT validity to be particularly satisfying: the fact that content, intervention and differential groups strategies are built directly into the item development and item analysis processes. Thus item selection and test revision are integrally related to analyzing and improving test validity. As always with issues of validity, the goal is to marshal evidence from a variety of sources so that, collectively, they can be used to investigate and support the validity of the test in question.

Sample Report: method
Sample Report: discussion