Sample report: ResultsDescriptive StatisticsTable
1 presents the descriptive statistics for this project. The different statistics
are labeled by headings across the top of the table including: the number of students
on each test (N), the number of items on each (k), the mean (M),
the standard deviation (SD), minimum score (Min.), maximum score (Max.), and the
range. The reading course tests are labeled in the first column. The rows are
ordered first according to course (ELI 72 or ELI 82), then according to whether
the test was a pre or post test, then according to form (A or B). Table
1: Descriptive Statistics
Reading course Test
| N | k | M | SD | Min. | Max. | Range |
| ELI 72 | | | | | | | |
PreA | 35
| 46 | 31.11 | 5.18 | 15 | 39 | 25 |
PreB | 29
| 46 | 30.90 | 5.47 | 16
| 41 | 26 |
PostA | 26 | 46 | 34.73 | 4.86 | 20 | 42
| 23 |
PostB | 35 | 46 | 33.57
| 3.81 | 28 | 41 | 14 |
| ELI 82 | | | | | | | |
PreA | 87
| 34 | 21.05 | 3.95 | 13 | 31 | 19 |
PreB | 65
| 34 | 21.26 | 3.92 | 10 | 30 | 21 |
PostA | 63
| 34 | 23.44 | 3.94 | 14 | 31
| 18 |
PostB | 67 | 34 | 23.12
| 3.90 | 14 | 31 | 18 |
Mean Item Statistics Table 2 presents the mean item statistics
for this project. Notice that the labels across the top first show that NRT, IRT,
and CRT estimates are all included. Naturally, examining the individual item statistics
was much more important for purposes of revising the tests, but mean item statistics
are the only practical means to provide readers with an overview of current item
quality of the tests in this project. The mean statistics for the NRT items included
classical test item facility and item discrimination indices Table 2:
Mean Item Statistics
| Reading |
|
|
|
|
Statistics
|
|
| Course |
NRT
|
|
IRT
|
|
CRT
|
| Test |
IF |
|
ID |
|
P |
|
Diff. |
|
Disc. |
Used |
|
DI |
|
Item |
|
B |
A |
| ELI 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
PreA
|
.68 |
 |
.23 |
 |
.67 |
 |
-1.46 |
 |
.38 |
45 (35)* |
 |
.08 |
 |
.07 |
 |
.18 |
.34 |
PreB
|
.67 |
|
.25 |
|
.67 |
|
-1.31 |
|
.44 |
46(29)* |
|
.06 |
|
.09 |
|
.23 |
.36 |
PostA
|
.76 |
|
.20 |
|
.74 |
|
-2.01 |
|
.42 |
44(26)* |
|
.08 |
|
.18 |
|
.33 |
.76 |
PostB
|
.73 |
|
.19 |
|
.71 |
|
-1.96 |
|
.38 |
43(35)* |
|
.06 |
|
.00 |
|
.00 |
.73
|
| ELI 82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
PreA
|
.62 |
|
.29 |
|
.62 |
|
-0.99 |
|
.36 |
34(87) |
|
.07 |
|
.07 |
|
.30 |
.39 |
PreB
|
.63 |
|
.28 |
|
.63 |
|
-1.14 |
|
.37 |
34(65) |
|
.05 |
|
.07 |
|
.26 |
.39 |
PostA
|
.69 |
|
.25 |
|
.69 |
|
-1.54 |
|
.40 |
34(63) |
|
.07 |
|
.20 |
|
.21 |
68 |
PostB
|
.68 |
|
.25 |
|
.68 |
|
-1.45 |
|
.38 |
34(67) |
|
.05 |
|
.19 |
|
.17 |
.64 |
* Either an item or person (or both) was deleted because 0% or 100%.
** Cut-points were set at .90 for pretest decisions and .60 for posttests. The
mean statistics for the NRT items indicate that our tests look statistically very
much like norm-referenced tests (say for placement purposes). In fact, if we set
about to select items for a revised test on the basis of these NRT item statistics,
the tests would eventually become relatively powerful NRTs. Instead, we have chosen
to use the two other types of item analysis for revising our tests for Criterion-referenced
purposes (IRT and CRT), and we will continue to do so. Because of the relatively
small sample sizes involved here, the IRT item estimates are based on a one-parameter
model. Our purpose in using IRT was to include these item difficulty estimates
in our selection processes. Note that the mean IRT difficulty estimates (Diff.)
are negative in all cases, which indicates that the items were on average relatively
easy for the students; fortunately this was more true on the posttests than on
the pretests. Note also that the IRT discrimination estimates reported in Table
2 are the slopes (in a one-parameter IRT analysis, these are kept constant across
all items). Caution must be used in thinking about these IRT results because the
sample sizes are really too small to be appropriate even for the one-parameter
model. We would have been much more comfortable if we had had at least 100 students
for each form of each test. Our primary reason for using the IRT analyses
at all was so that we would be able to use the individual student ability estimates
for examining appropriate cut-points for pass/fail decisions. The mean ability
estimates were not given here because in all cases they were zero. We would also
eventually like to be able to set up an item bank for each of these coursesa
task for which IRT is particularly well-suited. The CRT item statistics
shown in Table 2 include the difference index (DI), item Ø, the
B-index (B), and the agreement index (A) (see Shannon &
Cliver, 1987; Berk, 1984b). The DI for each item is obtained by subtracting
its item facility on the post test minus the facility for the same item on the
pretest (that is, DI = IFpost - IFpre) . Item Ø is an estimate of the degree
to which the students answers (right or wrong) are related to whether they
passed or failed the test. The B-index is the difference between the proportions
of correct answers on each item and the proportions of students passing and failing.
The agreement statistic is "the proportion of consistent item-test outcomes"
with regard to the particular students who correctly answered the item and passed
the test, as compared to those who missed the item and failed the test. Hence
the item agreement statistic is similar to the agreement coefficient used to investigate
the overall consistency, or dependability, of tests when they are used for making
decisions (see Cohen 1960; Subkoviak 1980, 1988). Remember these CRT item
statistics were used for individual item decisions, rather than as the mean item
statistics shown in Table 2. Note also that we calculated each of the cut-point
related item statistics for .50, .60, .70, .80, and .90 decisions. This strategy
ended up being very useful in thinking about item selection in terms of the types
of decisions we make with our tests, and the relative suitability of various cut-points
for our particular decisions. We make two types of decisions based on these tests.
Among other uses, the pretest results are used to identify those students who
were initially misplaced in order to move them up a level or exempt them. In contrast,
the posttest administrations are used mostly to judge whether students pass or
fail the course in question. The decision level was tentatively set at.90 for
pretest exemption decisions, and at about .60 for posttest pass or fail decisions.
The statistics shown in Table 2 are therefore based on .90 for pretests and .60
for posttests. Ultimately, in the test revision process, we would like to keep
those items which are strong for both types of decisions (pretest exemption and
posttest pass/fail decisions). Consistency Estimates Both NRT
reliability statistics and CRT dependability estimates are presented in Table
3. The NRT reliability estimates included here are the Cronbach alpha coefficient,
the split-half method (adjusted by the Spearman-Brown prophecy formula) estimate,
and the Guttman coefficient. Notice that these NRT coefficients are generally
fairly low, ranging from .529 to .822. However, note also that the ranges of talent
in courses such as these will generally be severely restricted by previous NRT
selection procedures for admissions and placement. As Brown (1984b), Ebel (1979),
and others have demonstrated, even an otherwise well-developed test will be unreliable
if the range of talent is depressed in the testing population. Given that framework,
the reliability estimates produced by these tests appears a bit more respectable,
even from an NRT perspective. Another approach to reliability estimation is the
standard error of measurement (SEM), which provides an estimate of the dispersion
of errors around a particular score or cut point. The SEM is presented just to
the right of the NRT reliability estimates just discussed. In this case, the SEM
is based on the odd-even, or split-half (adjusted), coefficients. Note that the
NRT perspective on reliability presented in this paragraph is largely irrelevant
to this project because the tests in question are CRTs. These statistics are presented
primarily for purposes of comparison and reference. Table 3: Reliability
and Dependability
| | | | | | | | | |
| Course | | |
NRT | | | |
CRT | | | | Test | Alpha | Odd-even | Guttman | SEM | phi | Phi* | CI |
| R | ELI 72 | | | | | | | |
| E | PreA | .704 | .814 | .809 | 2.234 | .674 | .928 | .068 |
| A | PreB | .750 | .822 | .816 | 1.879 | .713 | .932 | .068 |
| D | PostA | .713 | .785 | .784 | 2.253 | .691 | .892 | .062 |
| I | PostB | .573 | .661 | .660 | 2.218 | .497 | .823 | .065 |
| N | ELI 82 | | | | | | | |
| G | PreA | .575 | .651 | .651 | .2.334 | .541 | .927 | .082 |
| | PreB | .586 | .722 | .722 | .2.068 | .546 | .925 | .082 |
| | PostA | .617 | .531 | .529 | .2.695 | .584 | .719 | .078 |
| | PostB | .587 | .650 | .649 | .2.305 | .562 | .686 | .079 |
| | | | | | | | | |
* Cut-points were set at .90 for pretest decisions and .60 for posttests From
the CRT point of view, the phi and phi(lambda) coefficients are much more interesting.
The phi coefficients provide domain score estimates of the dependability of our
CRTs. The phi(lambda) coefficients provide decision consistency estimates based
on the squared-error loss agreement approach (see Berk 1980c, 1984). Both the
phi and phi(lambda) used here are based on the short-cut formulas presented in
Brown (1990a). Like the CRT item statistics, the phi(lambda) estimates for the
pretests are based on .90 cut-points, while those for the posttests are based
on .60 cut-points. Another statistic called the confidence interval (CI)
is shown in the last column on the right side of Table 3. The CI is analogous
to the SEM statistic and is the appropriate statistic for analyzing CRTs. The
CIs for the tests in this project ranged from .062 to .082. The CI is best interpreted
as the proportion of error that would be accounted for with 68 percent confidence
around an individuals proportion score. For example, the CI in the upper-right
corner of Table 3 would indicate that a person receiving a proportion score of
.70 (or 70 percent) would score within plus or minus one CI, or a band from .632
(.70 - .068 = .632) to .768 (.70 + .068 = .768) 68 percent of the time. In percent
score terms, this would simply be a band between 63.2 percent and 76.8 percent.
Note that the confidence interval is derived from a generalizability theory statistic
called the absolute error variance component (see Bolus, Hinofotis & Bailey
1982; Brennan 1980, 1984; Brown 1984c; Brown & Bailey 1984). Validity
At least two validity strategies are practical and appropriate in studying
CRTs: the content and construct approaches. This section of the paper will address
both of these approaches. Content validity will be discussed in terms of how items
were developed and construct validity will be examined in terms of differential
groups and intervention studies that were conducted. Content validity
approaches all require, in one form or another, the systematic examination of
the degree to which a set of test items approximate whatever content or abilities
the test was originally designed to assess. At UHM, the items for the CRTs are
always written to closely match the objectives of the courses. Since those objectives
are the domain that is being tested, every effort is being made at the item writing
level to match the items with the content and skills being taught in the courses.
Hence, content validity is an integral part of the item development process. Since
the course teachers do the actual item writing with feedback from the ELI director
and lead teachers, the items are not only designed to match the objectives, but
also to match those objectives as they are actually taught in our classrooms.
Once we are fairly comfortable with the tests in terms of dependability and validity,
we should turn to outside "experts" in order to get independent judgments
of how well the items match our objectives. Construct validity approaches
all require experimental demonstration of the degree to which a test is measuring
the psychological construct it claims to be measuring. Such experimental demonstrations
many take may shapes, but for CRT development, intervention and differential groups
studies are probably the most appropriate and practical. An intervention
study involves administering a pretest, then teaching the students whatever
construct is involved and testing them again after the instruction. If the test
is truly measuring the construct, the students scores should be significantly
higher on the posttest than they were on the pretest. This whole procedure, called
an intervention study, can provide one argument for the construct validity of
a test. In the present project, differences were found between pretest and
posttest means for both forms of the tests in each course; these results indicate
that some effect on the scores existed due to instruction. These differences,
or gains, ranged from five to eight percent as indicated by the average difference
indices (DI) reported in Table 2. The actual gains experienced by the students
who took our courses were considerably higher for two reasons: 1. The results
of this project include all students who took the pretest and posttest administrations.
Those students who scored high on the pretest and were exempted from the courses
did not take the posttest. Because the exempted students by definition scored
high on the pretest, but do not figure into the posttest results, they would have
the effect of diminishing the observed differences. In future analyses, such exempted
students will be deleted from the pretest analyses, that is, only students who
actually received instruction will be included in the analysis. 2. The tests
being analyzed here have not yet been extensively revised to select those items
which are most sensitive to instruction. When those items with the largest difference
indices are selected and the tests, we expect much larger gains to found by the
tests. This does not mean that the students will be learning more, but rather
that the tests will be assessing the courses more sensitively. The reader
should note that we cannot take full credit for the observed gains, nor can we
attribute them solely to the effects of our courses. The reality is that students
were simultaneously being exposed to English from many other sources on a daily
basis. In addition, many students were simultaneously taking other ELI courses
which could have influenced their learning of English. Nevertheless, we can say
that the observed differences between pretests and posttests reflect gains due
to the total English language experience that students had during that semester
at UHM. For the above reasons, we felt it would be premature to perform
statistical analyses of the current differences (using say t-tests or F
ratios), especially before addressing the issues described in 1. and 2. above.
Nevertheless, the intervention study approach to the construct validity of our
CRTs is important to us. At the test level, we are examining gains in terms of
the test means. At the item level, we are selecting the items that will remain
on future revised versions of the tests based on the difference index (a kind
of item by item intervention study). Thus we consider the construct validity approach
particularly important for insuring that our CRTs are solidly related to the learning
that is going on in our courses. A differential groups study involves
administering a test to two groups of students: one group that possesses the construct
in question (sometimes called masters), and another group that lacks the
construct (also known as nonmasters) (see Brown 1984c for an example of
this approach). In this project, we have conducted differential groups studies
in two ways: 1. First, we have compared the scores of students who passed
the courses (masters) with scores of students who failed (nonmasters). Quite naturally,
we found large differences between these two groups because passing or failing
was partly determined by the test itself. 2. Second, we have examined the
individual and mean item Ø, the B-index, and A estimates.
These statistics indicated a fairly strong relationship between the accuracy of
the students answers on individual items, and whether or not they passed
the course (i.e., whether they were masters or non-masters). Differential
groups studies are especially important in thinking about the degree to which
pass/fail decisions are valid and fair. Hence, they are integrally related to
consequential validity. We have found one aspect of CRT validity to be
particularly satisfying: the fact that content, intervention and differential
groups strategies are built directly into the item development and item analysis
processes. Thus item selection and test revision are integrally related to analyzing
and improving test validity. As always with issues of validity, the goal is to
marshal evidence from a variety of sources so that, collectively, they can be
used to investigate and support the validity of the test in question.
|