RESUMO
OBJECTIVE: To report on the development and calibration of the new Blood Pressure Dysregulation Measurement System (BPD-MS) item banks that assess the impact of BPD on health-related quality of life (HRQOL) and the daily activities of Veterans and non-Veterans with spinal cord injury (SCI). DESIGN: Cross-sectional survey study. SETTING: Two Veteran Affairs medical centers and a SCI model system site. PARTICIPANTS: 454 respondents with SCI (n=262 American Veterans and n=192 non-Veterans). N/A MAIN OUTCOME MEASURES: The BPD-MS item banks RESULTS: BPD item pools were developed and refined using literature reviews, qualitative data from focus groups, and cognitive debriefing of persons with SCI and professional caregivers. The item banks then underwent expert review, reading level assessment, and translatability review prior to field testing. The items pools consisted of 180 unique questions (items). Exploratory and confirmatory factor analyses, item response theory modeling, and differential item function investigations resulted in item banks that included a total of 150 items: 75 describing the impact of autonomic dysreflexia (AD) on HRQOL, 55 describing the impact of low blood pressure (LBP) on HRQOL, and 20 describing the impact of LBP on daily activities. In addition, 10-item short forms were constructed based on item response theory-derived item information values and the clinical relevance of item content. CONCLUSIONS: The new BPD-MS item banks and corresponding 10-item short forms were developed using established rigorous measurement development standards, which represents the first BPD-specific patient-reported outcomes measurement system unique for use in the SCI population.
RESUMO
BACKGROUND: The Generic Adherence for Chronic Diseases Profile is a French generic scale (GACID-P) developed to measure adherence in several disease areas such as cardiology, rheumatology, diabetes, cancer and infectiology. METHOD: We aimed to study the measurement invariance of the Generic Adherence for Chronic Diseases Profile by an item response model, optimize the new instrument version from item response model and qualitative content analyses results, and validate the instrument. The metric properties of the optimized version were studied according to classical test theory and item response model analysis. RESULTS: A sample of 397 patients consulting at two French hospitals (in diabetes, cardiology, rheumatology, cancerology and infectiology) and in four private practices was recruited; 314 (79%) patients also completed the questionnaire 15 days later. Factor analyses revealed four dimensions: "Forgetting to take medication", "Intention to comply with treatment", "Limitation of risk-related consumer habits" and "Healthy lifestyle". The item response model and content analyses optimized these four dimensions, regrouping 32 items in four dimensions of 25 items, including one item conditioned on tobacco use. The psychometric properties and scale calibration were satisfactory. One score per dimension was calculated as the sum of the items for the dimensions "Forgetting to take medication" and "Intention to comply with treatment" and as a weighted score according to the item response model analysis for the two other dimensions because of differential item functioning found for two items. CONCLUSION: Four adherence profile scores were obtained. The instrument validity was documented by a theoretical approach and content analysis. The Generic Adherence for Chronic Diseases Profile is now available for research targeting adherence in a broad perspective.
Assuntos
Estilo de Vida Saudável , Qualidade de Vida , Humanos , Calibragem , Doença Crônica , Análise FatorialRESUMO
Determining the number of dimensions is extremely important in applying item response theory (IRT) models to data. Traditional and revised parallel analyses have been proposed within the factor analysis framework, and both have shown some promise in assessing dimensionality. However, their performance in the IRT framework has not been systematically investigated. Therefore, we evaluated the accuracy of traditional and revised parallel analyses for determining the number of underlying dimensions in the IRT framework by conducting simulation studies. Six data generation factors were manipulated: number of observations, test length, type of generation models, number of dimensions, correlations between dimensions, and item discrimination. Results indicated that (a) when the generated IRT model is unidimensional, across all simulation conditions, traditional parallel analysis using principal component analysis and tetrachoric correlation performs best; (b) when the generated IRT model is multidimensional, traditional parallel analysis using principal component analysis and tetrachoric correlation yields the highest proportion of accurately identified underlying dimensions across all factors, except when the correlation between dimensions is 0.8 or the item discrimination is low; and (c) under a few combinations of simulated factors, none of the eight methods performed well (e.g., when the generation model is three-dimensional 3PL, the item discrimination is low, and the correlation between dimensions is 0.8).
RESUMO
The purpose of this study was to examine the effects of different data conditions on item parameter recovery and classification accuracy of three dichotomous mixture item response theory (IRT) models: the Mix1PL, Mix2PL, and Mix3PL. Manipulated factors in the simulation included the sample size (11 different sample sizes from 100 to 5000), test length (10, 30, and 50), number of classes (2 and 3), the degree of latent class separation (normal/no separation, small, medium, and large), and class sizes (equal vs. nonequal). Effects were assessed using root mean square error (RMSE) and classification accuracy percentage computed between true parameters and estimated parameters. The results of this simulation study showed that more precise estimates of item parameters were obtained with larger sample sizes and longer test lengths. Recovery of item parameters decreased as the number of classes increased with the decrease in sample size. Recovery of classification accuracy for the conditions with two-class solutions was also better than that of three-class solutions. Results of both item parameter estimates and classification accuracy differed by model type. More complex models and models with larger class separations produced less accurate results. The effect of the mixture proportions also differentially affected RMSE and classification accuracy results. Groups of equal size produced more precise item parameter estimates, but the reverse was the case for classification accuracy results. Results suggested that dichotomous mixture IRT models required more than 2,000 examinees to be able to obtain stable results as even shorter tests required such large sample sizes for more precise estimates. This number increased as the number of latent classes, the degree of separation, and model complexity increased.
RESUMO
This note is concerned with evaluation of location parameters for polytomous items in multiple-component measuring instruments. A point and interval estimation procedure for these parameters is outlined that is developed within the framework of latent variable modeling. The method permits educational, behavioral, biomedical, and marketing researchers to quantify important aspects of the functioning of items with ordered multiple response options, which follow the popular graded response model. The procedure is routinely and readily applicable in empirical studies using widely circulated software and is illustrated with empirical data.
RESUMO
In the literature of modern psychometric modeling, mostly related to item response theory (IRT), the fit of model is evaluated through known indices, such as χ2, M2, and root mean square error of approximation (RMSEA) for absolute assessments as well as Akaike information criterion (AIC), consistent AIC (CAIC), and Bayesian information criterion (BIC) for relative comparisons. Recent developments show a merging trend of psychometric and machine learnings, yet there remains a gap in the model fit evaluation, specifically the use of the area under curve (AUC). This study focuses on the behaviors of AUC in fitting IRT models. Rounds of simulations were conducted to investigate AUC's appropriateness (e.g., power and Type I error rate) under various conditions. The results show that AUC possessed certain advantages under certain conditions such as high-dimensional structure with two-parameter logistic (2PL) and some three-parameter logistic (3PL) models, while disadvantages were also obvious when the true model is unidimensional. It cautions researchers about the dangers of using AUC solely in evaluating psychometric models.
RESUMO
Complex span tasks are perhaps the most widely used paradigm to measure working memory capacity (WMC). Researchers assume that all types of complex span tasks assess domain-general WM. However, most research supporting this claim comes from factor analysis approaches that do not examine task performance at the item level, thus not allowing comparison of the characteristics of verbal and spatial complex span tasks. Item response theory (IRT) can help determine the extent to which different complex span tasks assess domain-general WM. In the current study, spatial and verbal complex span tasks were examined using IRT. The results revealed differences between verbal and spatial tasks in terms of item difficulty and block difficulty, and showed that most subjects with below-average ability were able to answer most items correctly across all tasks. In line with previous research, the findings suggest that examining domain-general WM by using only one task might elicit skewed scores based on task domain. Further, visuospatial complex span tasks should be prioritized as a measure of WMC if resources are limited.
RESUMO
OBJECTIVES: To assess: (1) the Eating Assessment Tool (EAT-10) with item response theory (IRT) to determine which individual items provide the most information, (2) the extent to which dysphagia is measured with subsets of items while maintaining precise score estimates, and (3) if 5-item scales have the differing discriminatory ability, as compared to the parent 10-item instrument. METHODS: Prospectively collected data from 2,339 patients who completed the EAT-10 questionnaire during evaluation at a tertiary care otolaryngology clinic were utilized. IRT analyses provided discrimination and location parameters associated with individual questions. Residual item correlations were also assessed for redundant information. Based on these results, three 5-item subsets were further evaluated using item information function curves. Areas under receiver-operator characteristic curves (ROC-AUC) were also calculated to evaluate the discriminatory ability for dysphagia-related clinical diagnoses. RESULTS: Item discrimination parameter estimates ranged from 1.71 to 5.46, with higher values indicating more information. Residual item correlations were determined within item pairs, and location parameters were calculated. Based on these data, in combination with clinical utility, three 5-item subsets were proposed and assessed. ROC-AUC analyses demonstrated no significant difference between the EAT-5-Alpha subset and the original 10-item instrument for discriminating dysphagia as a primary diagnosis (0.88, 0.88). The EAT-5-Clinical subset outperformed the original 10 instruments in ROC-AUC for aspiration. The EAT-5-Range subset was significantly associated with problems with thin liquids. CONCLUSIONS: IRT analyses distinguished three proposed 5-item subsets of the EAT-10 instrument, supporting shorter survey options, while still reflecting the impact of dysphagia without significant loss of discrimination. LEVEL OF EVIDENCE: 3 (Diagnostic testing with consistently applied reference standards, partial blinding). Laryngoscope, 2023.
RESUMO
BACKGROUND: Fatigue is a common daily experience and a symptom of various disorders. While scholars have discussed the use of the Fatigue Severity Scale (FSS) using item response theory (IRT), the characteristics of the Japanese version are not yet examined. This study evaluated the psychometric properties of the FSS using IRT and assessed its reliability and concurrent validity with a general sample in Japan. METHODS AND MEASURES: A total of 1,007 Japanese individuals participated in an online survey, with 692 of them providing valid data. Of these, 125 participants partook in a re-test after approximately 18 days and had their longitudinal data analyzed. In addition, the graded response model (GRM) was used to assess the FSS items' characteristics. RESULTS: The GRM's results recommended using seven items and a 6-point scale. The FSS's reliability was acceptable. Furthermore, the validity was adequate from the results of correlation and regression analyses. The synchronous effects models demonstrated that the Multidimensional Fatigue Inventory (MFI) enhanced depression, and depression enhanced FSS. CONCLUSION: This study suggested that the Japanese version of the FSS should be a 7-item scale with a 6-point response scale. Further investigations may reveal the different aspects of fatigue assessed by the analyzed fatigue measures.
Assuntos
Fadiga , Humanos , Psicometria , Reprodutibilidade dos Testes , Fadiga/diagnóstico , Inquéritos e Questionários , JapãoRESUMO
The COVID-19 pandemic's global emergence/spread caused widespread fear. Measurement/tracking of COVID-19 fear could facilitate remediation. Despite the Fear of COVID-19 Scale (FCV-19S)'s validation in multiple languages/countries, nationwide United States (U.S.) studies are scarce. Cross-sectional classical test theory-based validation studies predominate. Our longitudinal study sampled respondents to a 3-wave, nationwide, online survey. We calibrated the FCV-19S using a unidimensional graded response model. Item/scale monotonicity, discrimination, informativeness, goodness-of-fit, criterion validity, internal consistency, and test-retest reliability were assessed. Items 7, 6, and 3 consistently displayed very high discrimination. Other items had moderate-to-high discrimination. Items 3, 6, and 7 were most (items 1 and 5 the least) informative. [Correction added on 18 May 2023, after first online publication: In the preceding sentence, the term 'items one-fifth least' has been changed to 'items 1 and 5 the least'.] Item scalability was 0.62-0.69; full-scale scalability 0.65-0.67. Ordinal reliability coefficient was 0.94; test-retest intraclass correlation coefficient 0.84. Positive correlations with posttraumatic stress/anxiety/depression, and negative correlations with emotional stability/resilience supported convergent/divergent validity. The FCV-19S validly/reliably captures temporal variation in COVID-19 fear across the U.S.
RESUMO
The DSM-5 criteria for cannabis use disorder (CUD) combine DSM-IV dependence and abuse criteria (without legal problems) and new withdrawal and craving criteria. Information on dimensionality, internal reliability, and differential functioning of the DSM-5 CUD criteria is lacking. Additionally, dimensionality of the DSM-5 withdrawal items is unknown. This study examined the psychometric properties of the DSM-5 CUD criteria among adults who used cannabis in the past 7 days (N = 5,119). Adults with frequent cannabis use were recruited from the US general population through social media and filled in a web-based survey about demographics and cannabis use behaviors. Factor analysis was used to assess dimensionality, and item response theory analysis models were used to explore relationships between the criteria and the underlying latent trait (CUD), and whether each criterion and the criteria set functioned differently by demographic and clinical characteristics: sex, age, state-level cannabis laws, reasons for cannabis use, and frequency of use. The DSM-5 CUD criteria showed unidimensionality and provided information about the CUD latent trait across the severity spectrum. The cannabis withdrawal items indicated one underlying latent factor. While some CUD criteria functioned differently in specific subgroups, the criteria set as a whole functioned similarly across subgroups. In this online sample of adults with frequent cannabis use, evidence supports the reliability, validity, and utility of the DSM-5 CUD diagnostic criteria set, which can be used for determining a major risk of cannabis use, i.e., CUD, to inform cannabis policies and public health messaging, and for developing intervention strategies.
RESUMO
Precise individualized EEG source localization is predicated on having accurate subject-specific Lead Fields (LFs) obtained from their Magnetic Resonance Images (MRI). LF calculation is a complex process involving several error-prone steps that start with obtaining a realistic head model from the MRI and finalizing with computationally expensive solvers such as the Boundary Element Method (BEM) or Finite Element Method (FEM). Current Big-Data applications require the calculation of batches of hundreds or thousands of LFs. LF. Quality Control is conventionally checked subjectively by experts, a procedure not feasible in practice for larger batches. To facilitate this step, we introduce the Lead Field Automatic-Quality Control Index (LF-AQI) that flags LF with potential errors. We base our LF-AQI on the assumption that LFs obtained from simpler head models, i.e., the homogeneous head model LF (HHM-LF) or spherical head model LF (SHM-LF), deviate only moderately from a "good" realistic test LF. Since these simpler LFs are easier to compute and check for errors, they may serve as "reference LF" to detect anomalous realistic test LF. We investigated this assumption by comparing correlation-based channel ρmin(ref,test)and source τmin(ref,test) similarity indices (SI) between "gold standards," i.e., very accurate FEM and BEM LFs, and the proposed references (HHM-LF and SHM-LF). Surprisingly we found that the most uncomplicated possible reference, HHM-LF had high SI values with the gold standards-leading us to explore further use of the channel ρmin(HHM-LF,test)and source τmin(HHM-LF,test)SI as a basis for our LF-AQI. Indeed, these SI successfully detected five simulated scenarios of LFs artifacts. This result encouraged us to evaluate the SI on a large dataset and thus define our LF-AQI. We thus computed the SI of 1251 LFs obtained from the Child Mind Institute (CMI) MRI dataset. When ρmin(HHM-LF,test)and source τmin(HHM-LF,test) were plotted for all test subjects on a 2D space, most were tightly clustered around the median of a high similarity centroid (HSC), except for a smaller proportion of outliers. We define the LF-AQI for a given LF as the log Euclidean distance between its SI and the HSC median. To automatically detect outliers, the threshold is at the 90th percentile of the CMI LF-AQIs (-0.9755). LF-AQI greater than this threshold flag individual LF to be checked. The robustness of this LF-AQI screening was checked by repeated out-of-sample validation. Strikingly, minor corrections in re-processing the flagged cases eliminated their status as outliers. Furthermore, the "doubtful" labels assigned by LF-AQI were validated by neuroscience students using a Likert scale questionnaire designed to manually check the LF's quality. Item Response Theory (IRT) analysis was applied to the questionnaire results to compute an optimized model and a latent variable θ for that model. A linear mixed model (LMM) between the θ and LF-AQI resulted in an effect with a Cohen's d value of 1.3 and a p-value <0.001, thus validating the correspondence of LF-AQI with the visual quality control. We provide an open-source pipeline to implement both LF calculation and its quality control to allow further evaluation of our index.
Assuntos
Mapeamento Encefálico , Eletroencefalografia , Criança , Humanos , Mapeamento Encefálico/métodos , Simulação por Computador , Modelos Neurológicos , Controle de QualidadeRESUMO
PURPOSE: The aim of the present study was to compare scale and conditional reliability derived from item response theory analyses among the most commonly used, as well as several newly developed, observation, interview, and parent-report autism instruments. METHODS: When available, data sets were combined to facilitate large sample evaluation. Scale reliability (internal consistency, average corrected item-total correlations, and model reliability) and conditional reliability estimates were computed for total scores and for measure subscales. RESULTS: Generally good to excellent scale reliability was observed for total scores for all measures, scale reliability was weaker for RRB subscales of the ADOS and ADI-R, reflecting the relatively small number of items for these measures. For diagnostic measures, conditional reliability tended to be very good (> 0.80) in the regions of the latent trait where ASD and non-ASD developmental disability cases would be differentiated. For parent-report scales, conditional reliability of total scores tended to be excellent (> 0.90) across very wide ranges of autism symptom levels, with a few notable exceptions. CONCLUSIONS: These findings support the use of all of the clinical observation, interview, and parent-report autism symptom measures examined, but also suggest specific limitations that warrant consideration when choosing measures for specific clinical or research applications.
RESUMO
Matrix reasoning tasks are among the most widely used measures of cognitive ability in the behavioral sciences, but the lack of matrix reasoning tests in the public domain complicates their use. Here, we present an extensive investigation and psychometric validation of the matrix reasoning item bank (MaRs-IB), an open-access set of matrix reasoning items. In a first study, we calibrate the psychometric functioning of the items in the MaRs-IB in a large sample of adult participants (N = 1501). Using additive multilevel item structure models, we establish that the MaRs-IB has many desirable psychometric properties: its items span a wide range of difficulty, possess medium-to-large levels of discrimination, and exhibit robust associations between item complexity and difficulty. However, we also find that item clones are not always psychometrically equivalent and cannot be assumed to be exchangeable. In a second study, we demonstrate how experimenters can use the estimated item parameters to design new matrix reasoning tests using optimal item assembly. Specifically, we design and validate two new sets of test forms in an independent sample of adults (N = 600). We find these new tests possess good reliability and convergent validity with an established measure of matrix reasoning. We hope that the materials and results made available here will encourage experimenters to use the MaRs-IB in their research.
RESUMO
BACKGROUND: Assessing communication skills is necessary to facilitate pro-communication skills development programs. The 23-item Communication Scale (CS) is the most widely used tool for this purpose. Since there is a scarcity of validated tools to assess communication skills among Bangladeshi adolescents, we translated this questionnaire into Bangla and validated it on a Bangladeshi adolescent sample. METHODS: We conducted two independent rounds of large-scale surveys that yielded data from 621 Bangladeshi adolescents (AgeMean ± SD = 16.44 ± 1.32), of which 378 were males, and 244 were females. The participants completed the Bangla CS. A subset of the participants (n = 160) also completed the Bangla Beck's Hopelessness Scale (BBHS)-a measure of hopelessness. RESULTS: Exploratory factor analysis on the first-round data (n = 340) discarded six items and retained 17 items and revealed a unidimensional factor structure. Confirmatory Factor Analysis on the second-round data (n = 281) supported the unidimensional structure (CFI = 0.94, TLI = 0.93). Measurement invariance analysis indicated that the unidimensional structure was robust across gender (143 males vs 139 females). The scale exhibited a negative correlation with BBHS revealing the scale's concurrent validity (r = - 0.16, p < 0.01). The scale exhibited satisfactory reliability (ωt = 0.79). The Item Response Theory-based analysis revealed that the scale was reliable (> 0.70) across a sizable range of communication skills continuum (θ = - 5.3 to 2.3) and had excellent marginal reliability (0.80). All items had adequate discriminating power (0.90 ± 0.20). CONCLUSION: The psychometric analysis of the 17-item Bangla-CS indicated that the scale is reliable and valid. We recommend that researchers and mental health practitioners utilize this scale to evaluate communication skills among Bangladeshi adolescents.
RESUMO
Objective: Victims of intimate partner violence (IPV) often fear their intimate partners and the abuse they perpetrate against them. Fear in the context of IPV has been studied for decades yet, we lack a rigorously validated measure. The purpose of this study was to comprehensively evaluate the psychometric properties of a multi-item scale measuring fear of an abusive male partner and/or the abuse he perpetrates. Method: We used Item Response modeling to evaluate the psychometric properties of a scale measuring women's fear of IPV by their male partner across two distinct samples: 1) a calibration sample of 412 women and 2) a confirmation sample of 298 women. Results: Results provide a detailed overview of the psychometric functioning of the Intimate Partner Violence Fear-11 Scale. Items were strongly related to the latent fear factor, with discrimination values universally above a = 0.80 in both samples. Overall, the IPV Fear-11 Scale is psychometrically robust across both samples. All items were highly discriminating and the full scale was reliable across the range of the latent fear trait. Reliability was exceptionally high for measuring individuals experiencing moderate to high levels of fear. Finally, the IPV Fear-11 Scale was moderately to strongly correlated with depression symptoms, posttraumatic stress symptoms and physical victimization. Conclusions: The IPV Fear-11 Scale was psychometrically robust across both samples and was associated with a number of relevant covariates. Results support the utility of the IPV Fear-11 Scale for assessing fear of an abusive partner among women in relationships with men.
RESUMO
PURPOSE: The animated activity questionnaire (AAQ) is a computer-based measure of activity limitations. To answer a question, patients choose the animation of a person performing an activity that matches their own level of limitation. The AAQ has not yet been tested for suitability to be applied as computer-adaptive test (CAT). Thus, the objective of this study was to develop and evaluate an AAQ-based CAT to facilitate the application of the AAQ in daily clinical care. METHODS: Patients (n = 1408) with hip/knee osteoarthritis from Brazil, Denmark, France, The Netherlands, Norway, Spain, and the UK responded to all 17 AAQ items. Assumptions of item-response theory (IRT) modelling were investigated. To establish item parameters for the CAT, a graded response model was estimated. To evaluate the performance of post-hoc simulated AAQ-based CATs, precision, test length, and construct validity (correlations with well-established measures of activity limitations) were evaluated. RESULTS: Unidimensionality (CFI = 0.95), measurement invariance (R2-change < 2%), and IRT item fit (S-X2 p > .003) of the AAQ were supported. Performing simulated CATs, the mean test length was more than halved (≤ 8 items), while the range of precise measurement (standard error ≤ 0.3) was comparable to the full AAQ. The correlations between original AAQ scores and three AAQ-CAT versions were ≥ 0.95. Correlations of AAQ-CAT scores with patient-reported and performance measures of activity limitations were ≥ 0.60. CONCLUSION: The almost non-verbal AAQ-CAT is an innovative and efficient tool in patients with hip/knee osteoarthritis from various countries, measuring activity limitations with lower respondent burden, but similar precision and construct validity compared to the full AAQ.
RESUMO
Objective: To evaluate the psychometric properties of the GAD-7 by obtaining evidence of internal structure (dimensionality, precision and differential functioning of items) and association with external variables. Methods: A total of 2,219 protocols from three different studies conducted with Puerto Rican employees that administered the GAD-7 were selected for the current study. Item response theory modeling was used to assess internal structure, and linear association with external variables. Results: The items were adapted to a graduated response model, with high similarity in the discrimination and location parameters, as well as in the precision at the level of the items and in the total score. No violation of local independence and differential item functioning was detected. The association with convergent (work-related rumination) and divergent (work engagement, sex, and age) variables were theoretically consistent. Conclusion: The GAD-7 is a psychometrically robust tool for detecting individual variability in symptoms of anxiety in workers.
RESUMO
To measure the parallel interactive development of latent ability and processing speed using longitudinal item response accuracy (RA) and longitudinal response time (RT) data, we proposed three longitudinal joint modeling approaches from the structural equation modeling perspective, namely unstructured-covariance-matrix-based longitudinal joint modeling, latent growth curve-based longitudinal joint modeling, and autoregressive cross-lagged longitudinal joint modeling. The proposed modeling approaches can not only provide the developmental trajectories of latent ability and processing speed individually, but also exploit the relationship between the change in latent ability and processing speed through the across-time relationships of these two constructs. The results of two empirical studies indicate that (1) all three models are practically applicable and have highly consistent conclusions in terms of the changes in ability and speed in the analysis of the same data set, and (2) additional analysis of the RT data and acquisition of individual processing speed measurements can reveal the parallel interactive development phenomena that are difficult to detect using RA data alone. Furthermore, the results of our simulation study demonstrate that the proposed Bayesian Markov chain Monte Carlo estimation algorithm can ensure accurate model parameter recovery for all three proposed longitudinal joint models. Finally, the implications of our findings are discussed from the research and practice perspectives.
RESUMO
OBJECTIVE: Electronic cigarettes are the most commonly used tobacco products by young adults. Measures of beliefs about outcomes of use (i.e., expectancies) can be helpful in predicting use, as well as informing and evaluating interventions to impact use. METHODS: We surveyed young adult students (N = 2296, Mean age=20.0, SD=1.8, 64 % female, 34 % White) from a community college, a historically black university, and a state university. Students answered ENDS expectancy items derived from focus groups and expert panel refinement using Delphi methods. Factor Analysis and Item Response Theory (IRT) methods were used to understand relevant factors and identify useful items. RESULTS: A 5-factor solution [Positive Reinforcement (consists of Stimulation, Sensorimotor, and Taste subthemes, α = .92), Negative Consequences (Health Risks and Stigma, α = .94), Negative Affect Reduction (α = .95), Weight Control (α = .92), and Addiction (α = .87)] fit the data well (CFI=0.95; TLI=0.94; RMSEA=0.05) and was invariant across subgroups. Factors were significantly correlated with relevant vaping measures, including vaping susceptibility and lifetime vaping. Hierarchical linear regression demonstrated factors were significant predictors of lifetime vaping after controlling for demographics, vaping ad exposure, and peer/family vaping. IRT analyses indicated that individual items tended to be related to their underlying constructs (a parameters ranged from 1.26 to 3.18) and covered a relatively wide range of the expectancies continuum (b parameters ranged from -0.72 to 2.47). CONCLUSIONS: A novel ENDS expectancy measure appears to be a reliable measure for young adults with promising results in the domains of concurrent validity, incremental validity, and IRT characteristics. This tool may be helpful in predicting use and informing future interventions. IMPLICATIONS: Findings provide support for the future development of computerized adaptive testing of vaping beliefs. Expectancies appear to play a role in vaping similar to smoking and other substance use. Public health messaging should target expectancies to modify young adult vaping behavior.