Ability Estimation Under Different Item Parameterization and Scoring Models

Description: A Monte Carlo simulation study investigated the effect of scoring format, item parameterization, threshold configuration, and prior ability distribution on the accuracy of ability estimation given various IRT models. Item response data on 30 items from 1,000 examinees was simulated using known item parameters and ability estimates. The item response data sets were submitted to seven dichotomous or polytomous IRT models with different item parameterization to estimate examinee ability. The accuracy of the ability estimation for a given IRT model was assessed by the recovery rate and the root mean square errors. The results indicated that polytomous models produced more accurate ability estimates than the dichotomous models, under all combinations of research conditions, as indicated by higher recovery rates and lower root mean square errors. For the item parameterization models, the one-parameter model out-performed the two-parameter and three-parameter models under all research conditions. Among the polytomous models, the partial credit model had more accurate ability estimation than the other three polytomous models. The nominal categories model performed better than the general partial credit model and the multiple-choice model with the multiple-choice model the least accurate. The results further indicated that certain prior ability distributions had an effect on the accuracy of ability estimation; however, no clear order of accuracy among the four prior distribution groups was identified due to an interaction between prior ability distribution and threshold configuration. The recovery rate was lower when the test items had categories with unequal threshold distances, were close at one end of the ability/difficulty continuum, and were administered to a sample of examinees whose population ability distribution was skewed to the same end of the ability continuum.
Date: May 2002
Creator: Si, Ching-Fung B.
A comparison of traditional and IRT factor analysis.

Description: This study investigated the item parameter recovery of two methods of factor analysis. The methods researched were a traditional factor analysis of tetrachoric correlation coefficients and an IRT approach to factor analysis which utilizes marginal maximum likelihood estimation using an EM algorithm (MMLE-EM). Dichotomous item response data was generated under the 2-parameter normal ogive model (2PNOM) using PARDSIM software. Examinee abilities were sampled from both the standard normal and uniform distributions. True item discrimination, a, was normal with a mean of .75 and a standard deviation of .10. True b, item difficulty, was specified as uniform [-2, 2]. The two distributions of abilities were completely crossed with three test lengths (n= 30, 60, and 100) and three sample sizes (N = 50, 500, and 1000). Each of the 18 conditions was replicated 5 times, resulting in 90 datasets. PRELIS software was used to conduct a traditional factor analysis on the tetrachoric correlations. The IRT approach to factor analysis was conducted using BILOG 3 software. Parameter recovery was evaluated in terms of root mean square error, average signed bias, and Pearson correlations between estimated and true item parameters. ANOVAs were conducted to identify systematic differences in error indices. Based on many of the indices, it appears the IRT approach to factor analysis recovers item parameters better than the traditional approach studied. Future research should compare other methods of factor analysis to MMLE-EM under various non-normal distributions of abilities.
Date: December 2004
Creator: Kay, Cheryl Ann
Influence of item response theory and type of judge on a standard set using the iterative Angoff standard setting method

Description: The purpose of this investigation was to determine the influence of item response theory and different types of judges on a standard. The iterative Angoff standard setting method was employed by all judges to determine a cut-off score for a public school district-wide criterion-reformed test.
Date: August 1992
Creator: Hamberlin, Melanie Kidd
Using Posterior Predictive Checking of Item Response Theory Models to Study Invariance Violations

Description: The common practice for testing measurement invariance is to constrain parameters to be equal over groups, and then evaluate the model-data fit to reject or fail to reject the restrictive model. Posterior predictive checking (PPC) provides an alternative approach to evaluating model-data discrepancy. This paper explores the utility of PPC in estimating measurement invariance. The simulation results show that the posterior predictive p (PP p) values of item parameter estimates respond to various invariance violations, whereas the PP p values of item-fit index may fail to detect such violations. The current paper suggests comparing group estimates and restrictive model estimates with posterior predictive distributions in order to demonstrate the pattern of misfit graphically.
Date: May 2017
Creator: Xin, Xin
Assessment of Competencies among Doctoral Trainees in Psychology

Description: The recent shift to a culture of competence has permeated several areas of professional psychology, including competency identification, competency-based education training, and competency assessment. A competency framework has also been applied to various programs and specialty areas within psychology, such as clinical, counseling, clinical health, school, cultural diversity, neuro-, gero-, child, and pediatric psychology. Despite the spread of competency focus throughout psychology, few standardized measures of competency assessment have been developed. To the authors' knowledge, only four published studies on measures of competency assessment in psychology currently exist. While these measures demonstrate significant steps in progressing the assessment of confidence, three of these measures were designed for use with individual programs, two of these international (i.e., UK and Taiwan). The current study applied the seminal Competency Benchmarks, via a recently adapted benchmarks form (i.e., Practicum Evaluation form; PEF), to practicum students at the University of North Texas. In addition to traditional supervisor ratings, the present study also involved self-, peer supervisor, and peer supervisee ratings to provide 360-degree evaluations. Item-response theory (IRT) was used to evaluate the psychometric properties of the PEF and inform potential revisions of this form. Supervisor ratings of competency were found to fit the Rasch model specified, lending support to use of the benchmarks framework as assessed by this form. Self- and peer-ratings were significantly correlated with supervisor ratings, indicating that there may be some utility to 360-degree evaluations. Finally, as predicted, foundational competencies were rated as significantly higher than functional competencies, and competencies improved significantly with training. Results of the current study provide clarity about the utility of the PEF and inform our understanding of practicum-level competencies.
Date: August 2017
Creator: Price, Samantha
Development of an Outcome Measure for Use in Psychology Training Clinics

Description: The ability to monitor client change in psychotherapy over time is vital to quality assurance in service delivery as well as the continuing improvement of psychotherapy research. Unfortunately, there is not currently a comprehensive, affordable, and easily utilized outcome measure for psychotherapy specifically normed and standardized for use in psychology training clinics. The current study took the first steps in creating such an outcome measure. Following development of an item bank, factor analysis and item-response theory analyses were applied to data gathered from a stratified sample of university (n = 101) and community (n = 261) participants. The factor structure did not support a phase model conceptualization, but did reveal a structure consistent with the theoretical framework of the research domain criteria (RDoC). Suggestions for next steps in the measure development process are provided and implications discussed.
Date: May 2017
Creator: Davis, Elizabeth C.
CT3 as an Index of Knowledge Domain Structure: Distributions for Order Analysis and Information Hierarchies

Description: The problem with which this study is concerned is articulating all possible CT3 and KR21 reliability measures for every case of a 5x5 binary matrix (32,996,500 possible matrices). The study has three purposes. The first purpose is to calculate CT3 for every matrix and compare the results to the proposed optimum range of .3 to .5. The second purpose is to compare the results from the calculation of KR21 and CT3 reliability measures. The third purpose is to calculate CT3 and KR21 on every strand of a class test whose item set has been reduced using the difficulty strata identified by Order Analysis. The study was conducted by writing a computer program to articulate all possible 5 x 5 matrices. The program also calculated CT3 and KR21 reliability measures for each matrix. The nonparametric technique of Order Analysis was applied to two sections of test items to stratify the items into difficulty levels. The difficulty levels were used to reduce the item set from 22 to 9 items. All possible strands or chains of these items were identified so that both reliability measures (CT3 and KR21) could be calculated. One major finding of this study indicates that .3 to .5 is a desirable range for CT3 (cumulative p=.86 to p=.98) if cumulative frequencies are measured. A second major finding is that the KR21 reliability measure produced an invalid result more than half the time. The last major finding is that CT3, rescaled to range between 0 and 1, supports De Vellis' guidelines for reliability measures. The major conclusion is that CT3 is a better measure of reliability since it considers both inter- and intra-item variances.
Date: December 2002
Creator: Swartz Horn, Rebecca
An item response theory analysis of the Rey Osterrieth Complex Figure Task.

Description: The Rey-Osterrieth Complex Figure Task (ROCFT) has been a standard in neuropsychological assessment for six decades. Many researchers have contributed administration procedures, additional scoring systems and normative data to improve its utility. Despite the abundance of research, the original 36-point scoring system still reigns among clinicians despite documented problems with ceiling and floor effects and poor discrimination between levels of impairment. This study is an attempt to provide a new method based upon item response theory that will allow clinicians to better describe the impairment levels of their patients. Through estimation of item characteristic curves, underlying traits can be estimated while taking into account varying levels of difficulty and discrimination within the set of individual items. The ultimate goal of the current research is identification of a subset of ROCFT items that can be examined in addition to total scores to provide an extra level of information for clinicians, particularly when they are faced with a need to discriminate severely and mildly impaired patients.
Date: December 2008
Creator: Everitt, Alaina
Stratified item selection and exposure control in unidimensional adaptive testing in the presence of two-dimensional data.

Description: It is not uncommon to use unidimensional item response theory (IRT) models to estimate ability in multidimensional data. Therefore it is important to understand the implications of summarizing multiple dimensions of ability into a single parameter estimate, especially if effects are confounded when applied to computerized adaptive testing (CAT). Previous studies have investigated the effects of different IRT models and ability estimators by manipulating the relationships between item and person parameters. However, in all cases, the maximum information criterion was used as the item selection method. Because maximum information is heavily influenced by the item discrimination parameter, investigating a-stratified item selection methods is tenable. The current Monte Carlo study compared maximum information, a-stratification, and a-stratification with b blocking item selection methods, alone, as well as in combination with the Sympson-Hetter exposure control strategy. The six testing conditions were conditioned on three levels of interdimensional item difficulty correlations and four levels of interdimensional examinee ability correlations. Measures of fidelity, estimation bias, error, and item usage were used to evaluate the effectiveness of the methods. Results showed either stratified item selection strategy is warranted if the goal is to obtain precise estimates of ability when using unidimensional CAT in the presence of two-dimensional data. If the goal also includes limiting bias of the estimate, Sympson-Hetter exposure control should be included. Results also confirmed that Sympson-Hetter is effective in optimizing item pool usage. Given these results, existing unidimensional CAT implementations might consider employing a stratified item selection routine plus Sympson-Hetter exposure control, rather than recalibrate the item pool under a multidimensional model.
Date: August 2009
Creator: Kalinowski, Kevin E.
A Comparison of Traditional Norming and Rasch Quick Norming Methods

Description: The simplicity and ease of use of the Rasch procedure is a decided advantage. The test user needs only two numbers: the frequency of persons who answered each item correctly and the Rasch-calibrated item difficulty, usually a part of an existing item bank. Norms can be computed quickly for any specific group of interest. In addition, once the selected items from the calibrated bank are normed, any test, built from the item bank, is automatically norm-referenced. Thus, it was concluded that the Rasch quick norm procedure is a meaningful alternative to traditional classical true score norming for test users who desire normative data.
Date: August 1993
Creator: Bush, Joan Spooner
A Structural and Psychometric Evaluation of a Situational Judgment Test: The Workplace Skills Survey

Description: Some basic but desirable employability skills are antecedents of job performance. The Workplace Skills Survey (WSS) is a 48-item situational judgment test (SJT) used to assess non-technical workplace skills for both entry-level and experienced workers. Unfortunately, the psychometric evidence for use of its scores is far from adequate. The purpose of current study was two-fold: (a) to examine the proposed structure of WSS scores using confirmatory factor analysis (CFA), and (b) to explore the WSS item functioning and performance using item response theory (IRT). A sample of 1,018 Jamaican unattached youth completed the WSS instrument as part of a longitudinal study on the efficacy of a youth development program in Jamaica. Three CFA models were tested for the construct validity of WSS scores. Parameter estimations of item difficulty, item discrimination, and examinee’s proficiency estimations were obtained with item response theory (IRT) and plotted in item characteristics curves (ICCs) and item information curves (IICs). Results showed that the WSS performed quite well as a whole and provided precise measurement especially for respondents at latent trait levels of -0.5 and +1.5. However, some modifications of some items were recommended. CFA analyses showed supportive evidence of the one-factor construct model, while the six-factor model and higher-order model were not achieved. Several directions for future research are suggested.
Date: August 2014
Creator: Wei, Min
