Search Results

A Comparison of Three Item Selection Methods in Criterion-Referenced Tests
This study compared three methods of selecting the best discriminating test items and the resultant test reliability of mastery/nonmastery classifications. These three methods were (a) the agreement approach, (b) the phi coefficient approach, and (c) the random selection approach. Test responses from 1,836 students on a 50-item physical science test were used, from which 90 distinct data sets were generated for analysis. These 90 data sets contained 10 replications of the combination of three different sample sizes (75, 150, and 300) and three different numbers of test items (15, 25, and 35). The results of this study indicated that the agreement approach was an appropriate method to be used for selecting criterion-referenced test items at the classroom level, while the phi coefficient approach was an appropriate method to be used at the district and/or state levels. The random selection method did not have similar characteristics in selecting test items and produced the lowest reliabilities, when compared with the agreement and the phi coefficient approaches.
Cross-Cultural Validity of the Test of Non-Verbal Intelligence
The purpose of this study was to investigate the extent to which a non-verbal test of intelligence, the Test of Non-Verbal Intelligence (TONI), may be used for assessing intellectual abilities of children in India. This investigation is considered important since current instruments used in India were developed several years ago and do not adequately reflect present standards of performance. Further, current instruments do not demonstrate adequate validity, as procedures for development and cultural transport were frequently not in adherence to recommended guidelines for such practice. Data were collected from 91 normally achieving and 18 mentally retarded Indian children, currently enrolled in elementary schools. Data from an American comparison group were procured from the authors of the TONI. Subjects were matched on age, grade, and area of residence. Subjects were also from comparative socioeconomic backgrounds. Literature review of the theoretical framework supporting cross-cultural measurement of intellectual ability, a summary of major instruments developed for cross-cultural use, non-verbal measures of intellectual ability in India, and issues in cross-cultural research are discussed, with recommended methodology for test transport. Major findings are: (a) the factor scales derived from the Indian and American normally achieving groups indicate significant differences; (b) items 1, 3, 5, 8, 10, and 22 are biased against the Indian group, though overall item characteristic curves are not significantly different; (c) mean raw scores on the TONI are significantly different between second and third grade Indian subjects; and (d) mean TONI Quotients are significantly different between normally achieving and mentally retarded Indian subjects. It is evident that deletion of biased items and rescaling would be necessary for the TONI to be valid in the Indian context. However, because it does discriminate between subjects at different levels of ability, adaptation for use in India is justified. It may prove to be a more …
The Characteristics and Properties of the Threshold and Squared-Error Criterion-Referenced Agreement Indices
Educators who use criterion-referenced measurement to ascertain the current level of performance of an examinee in order that the examinee may be classified as either a master or a nonmaster need to know the accuracy and consistency of their decisions regarding assignment of mastery states. This study examined the sampling distribution characteristics of two reliability indices that use the squared-error agreement function: Livingston's k^2(X,Tx) and Brennan and Kane's M(C). The sampling distribution characteristics of five indices that use the threshold agreement function were also examined: Subkoviak's Pc. Huynh's p and k. and Swaminathan's p and k. These seven methods of calculating reliability were also compared under varying conditions of sample size, test length, and criterion or cutoff score. Computer-generated data provided randomly parallel test forms for N = 2000 cases. From this, 1000 samples were drawn, with replacement, and each of the seven reliability indices was calculated. Descriptive statistics were collected for each sample set and examined for distribution characteristics. In addition, the mean value for each index was compared to the population parameter value of consistent mastery/nonmastery classifications. The results indicated that the sampling distribution characteristics of all seven reliability indices approach normal characteristics with increased sample size. The results also indicated that Huynh's p was the most accurate estimate of the population parameter, with the smallest degree of negative bias. Swaminathan's p was the next best estimate of the population parameter, but it has the disadvantage of requiring two test administrations, while Huynh's p index only requires one administration.
Comparison of Methods for Computation and Cumulation of Effect Sizes in Meta-Analysis
This study examined the statistical consequences of employing various methods of computing and cumulating effect sizes in meta-analysis. Six methods of computing effect size, and three techniques for combining study outcomes, were compared. Effect size metrics were calculated with one-group and pooled standardizing denominators, corrected for bias and for unreliability of measurement, and weighted by sample size and by sample variance. Cumulating techniques employed as units of analysis the effect size, the study, and an average study effect. In order to determine whether outcomes might vary with the size of the meta-analysis, mean effect sizes were also compared for two smaller subsets of studies. An existing meta-analysis of 60 studies examining the effectiveness of computer-based instruction was used as a data base for this investigation. Recomputation of the original study data under the six different effect size formulas showed no significant difference among the metrics. Maintaining the independence of the data by using only one effect size per study, whether a single or averaged effect, produced a higher mean effect size than averaging all effect sizes together, although the difference did not reach statistical significance. The sampling distribution of effect size means approached that of the population of 60 studies for subsets consisting of 40 studies, but not for subsets of 20 studies. Results of this study indicated that the researcher may choose any of the methods for effect size calculation or cumulation without fear of biasing the outcome of the metaanalysis. If weighted effect sizes are to be used, care must be taken to avoid giving undue influence to studies which may have large sample sizes, but not necessarily be the most meaningful, theoretically representative, or elegantly designed. It is important for the researcher to locate all relevant studies on the topic under investigation, since selective or even random …
Back to Top of Screen