Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
1.
Biostatistics ; 23(4): 1074-1082, 2022 10 14.
Artigo em Inglês | MEDLINE | ID: mdl-34718422

RESUMO

There is a great need for statistical methods for analyzing skewed responses in complex sample surveys. Quantile regression is a logical option in addressing this problem but is often accompanied by incorrect variance estimation. We show how the variance can be estimated correctly by including the survey design in the variance estimation process. In a simulation study, we illustrate that the variance of the median regression estimator has a very small relative bias with appropriate coverage probability. The motivation for our work stems from the National Health and Nutrition Examination Survey where we demonstrate the impact of our results on iodine deficiency in females compared with males adjusting for other covariates.


Assuntos
Iodo , Viés , Simulação por Computador , Feminino , Humanos , Masculino , Inquéritos Nutricionais , Inquéritos e Questionários
2.
Psychother Res ; 33(6): 683-695, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-36669124

RESUMO

Objective: The occurrence of dropout from psychological interventions is associated with poor treatment outcome and high health, societal and economic costs. Recently, machine learning (ML) algorithms have been tested in psychotherapy outcome research. Dropout predictions are usually limited by imbalanced datasets and the size of the sample. This paper aims to improve dropout prediction by comparing ML algorithms, sample sizes and resampling methods. Method: Twenty ML algorithms were examined in twelve subsamples (drawn from a sample of N = 49,602) using four resampling methods in comparison to the absence of resampling and to each other. Prediction accuracy was evaluated in an independent holdout dataset using the F1-Measure. Results: Resampling methods improved the performance of ML algorithms and down-sampling can be recommended, as it was the fastest method and as accurate as the other methods. For the highest mean F1-Score of .51 a minimum sample size of N = 300 was necessary. No specific algorithm or algorithm group can be recommended. Conclusion: Resampling methods could improve the accuracy of predicting dropout in psychological interventions. Down-sampling is recommended as it is the least computationally taxing method. The training sample should contain at least 300 cases.


Assuntos
Algoritmos , Aprendizado de Máquina , Humanos , Tamanho da Amostra , Psicoterapia
3.
Sensors (Basel) ; 22(14)2022 Jul 07.
Artigo em Inglês | MEDLINE | ID: mdl-35890778

RESUMO

Due to its high sensitivity, electrohysterography (EHG) has emerged as an alternative technique for predicting preterm labor. The main obstacle in designing preterm labor prediction models is the inherent preterm/term imbalance ratio, which can give rise to relatively low performance. Numerous studies obtained promising preterm labor prediction results using the synthetic minority oversampling technique. However, these studies generally overestimate mathematical models' real generalization capacity by generating synthetic data before splitting the dataset, leaking information between the training and testing partitions and thus reducing the complexity of the classification task. In this work, we analyzed the effect of combining feature selection and resampling methods to overcome the class imbalance problem for predicting preterm labor by EHG. We assessed undersampling, oversampling, and hybrid methods applied to the training and validation dataset during feature selection by genetic algorithm, and analyzed the resampling effect on training data after obtaining the optimized feature subset. The best strategy consisted of undersampling the majority class of the validation dataset to 1:1 during feature selection, without subsequent resampling of the training data, achieving an AUC of 94.5 ± 4.6%, average precision of 84.5 ± 11.7%, maximum F1-score of 79.6 ± 13.8%, and recall of 89.8 ± 12.1%. Our results outperformed the techniques currently used in clinical practice, suggesting the EHG could be used to predict preterm labor in clinics.


Assuntos
Trabalho de Parto Prematuro , Nascimento Prematuro , Feminino , Humanos , Recém-Nascido , Modelos Teóricos , Trabalho de Parto Prematuro/diagnóstico , Nascimento Prematuro/diagnóstico , Útero
4.
Artigo em Inglês | MEDLINE | ID: mdl-36688204

RESUMO

Estimation of nonlinear curves and surfaces has long been the focus of semiparametric and nonparametric regression analysis. What has been less studied is the comparison of nonlinear functions. In lower-dimensional situations, inference typically involves comparisons of curves and surfaces. The existing comparative procedures are subject to various limitations, and few computational tools have been made available for off-the-shelf use. To address these limitations, two modified testing procedures for nonlinear curve and surface comparisons are proposed. The proposed computational tools are implemented in an R package, with a syntax similar to that of the commonly used model fitting packages. An R Shiny application is provided with an interactive interface for analysts who do not use R. The new tests are consistent against fixed alternative hypotheses. Theoretical details are presented in an appendix. Operating characteristics of the proposed tests are assessed against the existing methods. Applications of the methods are illustrated through real data examples.

5.
Biometrics ; 74(1): 196-206, 2018 03.
Artigo em Inglês | MEDLINE | ID: mdl-29542118

RESUMO

Researchers in genetics and other life sciences commonly use permutation tests to evaluate differences between groups. Permutation tests have desirable properties, including exactness if data are exchangeable, and are applicable even when the distribution of the test statistic is analytically intractable. However, permutation tests can be computationally intensive. We propose both an asymptotic approximation and a resampling algorithm for quickly estimating small permutation p-values (e.g., <10-6) for the difference and ratio of means in two-sample tests. Our methods are based on the distribution of test statistics within and across partitions of the permutations, which we define. In this article, we present our methods and demonstrate their use through simulations and an application to cancer genomic data. Through simulations, we find that our resampling algorithm is more computationally efficient than another leading alternative, particularly for extremely small p-values (e.g., <10-30). Through application to cancer genomic data, we find that our methods can successfully identify up- and down-regulated genes. While we focus on the difference and ratio of means, we speculate that our approaches may work in other settings.


Assuntos
Genômica/métodos , Modelos Estatísticos , Algoritmos , Animais , Simulação por Computador , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Genômica/estatística & dados numéricos , Humanos , Neoplasias/genética
6.
Biometrics ; 74(2): 653-662, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-29120492

RESUMO

Complex interplay between genetic and environmental factors characterizes the etiology of many diseases. Modeling gene-environment (GxE) interactions is often challenged by the unknown functional form of the environment term in the true data-generating mechanism. We study the impact of misspecification of the environmental exposure effect on inference for the GxE interaction term in linear and logistic regression models. We first examine the asymptotic bias of the GxE interaction regression coefficient, allowing for confounders as well as arbitrary misspecification of the exposure and confounder effects. For linear regression, we show that under gene-environment independence and some confounder-dependent conditions, when the environment effect is misspecified, the regression coefficient of the GxE interaction can be unbiased. However, inference on the GxE interaction is still often incorrect. In logistic regression, we show that the regression coefficient is generally biased if the genetic factor is associated with the outcome directly or indirectly. Further, we show that the standard robust sandwich variance estimator for the GxE interaction does not perform well in practical GxE studies, and we provide an alternative testing procedure that has better finite sample properties.


Assuntos
Exposição Ambiental , Interação Gene-Ambiente , Modelos Genéticos , Viés , Fatores de Confusão Epidemiológicos , Humanos , Modelos Lineares , Erro Científico Experimental/estatística & dados numéricos
7.
Biostatistics ; 17(1): 1-15, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26363037

RESUMO

For aggregation tests of genes or regions, the set of included variants often have small total minor allele counts (MACs), and this is particularly true when the most deleterious sets of variants are considered. When MAC is low, commonly used asymptotic tests are not well calibrated for binary phenotypes and can have conservative or anti-conservative results and potential power loss. Empirical p-values obtained via resampling methods are computationally costly for highly significant p-values and the results can be conservative due to the discrete nature of resampling tests. Based on the observation that only the individuals containing minor alleles contribute to the score statistics, we develop an efficient resampling method for single and multiple variant score-based tests that can adjust for covariates. Our method can improve computational efficiency >1000-fold over conventional resampling for low MAC variant sets. We ameliorate the conservativeness of results through the use of mid-p-values. Using the estimated minimum achievable p-value for each test, we calibrate QQ plots and provide an effective number of tests. In analysis of a case-control study with deep exome sequence, we demonstrate that our methods are both well calibrated and also reduce computation time significantly compared with resampling methods.


Assuntos
Estudos de Associação Genética/métodos , Variação Genética , Modelos Estatísticos , Análise de Sequência de DNA/métodos , Calibragem , Humanos
8.
Philos Trans A Math Phys Eng Sci ; 375(2100)2017 Aug 13.
Artigo em Inglês | MEDLINE | ID: mdl-29052545

RESUMO

Data-driven risk analysis involves the inference of probability distributions from measured or simulated data. In the case of a highly reliable system, such as the electricity grid, the amount of relevant data is often exceedingly limited, but the impact of estimation errors may be very large. This paper presents a robust non-parametric Bayesian method to infer possible underlying distributions. The method obtains rigorous error bounds even for small samples taken from ill-behaved distributions. The approach taken has a natural interpretation in terms of the intervals between ordered observations, where allocation of probability mass across intervals is well specified, but the location of that mass within each interval is unconstrained. This formulation gives rise to a straightforward computational resampling method: Bayesian interval sampling. In a comparison with common alternative approaches, it is shown to satisfy strict error bounds even for ill-behaved distributions.This article is part of the themed issue 'Energy management: flexibility, risk and optimization'.

9.
Metrologia ; 54(2): 204-217, 2017 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-29056762

RESUMO

In the electronic measurement of the Boltzmann constant based on Johnson noise thermometry, the ratio of the power spectral densities of thermal noise across a resistor at the triple point of water, and pseudo-random noise synthetically generated by a quantum-accurate voltage-noise source is constant to within 1 part in a billion for frequencies up to 1 GHz. Given knowledge of this ratio, and the values of other parameters that are known or measured, one can determine the Boltzmann constant. Due, in part, to mismatch between transmission lines, the experimental ratio spectrum varies with frequency. We model this spectrum as an even polynomial function of frequency where the constant term in the polynomial determines the Boltzmann constant. When determining this constant (offset) from experimental data, the assumed complexity of the ratio spectrum model and the maximum frequency analyzed (fitting bandwidth) dramatically affects results. Here, we select the complexity of the model by cross-validation - a data-driven statistical learning method. For each of many fitting bandwidths, we determine the component of uncertainty of the offset term that accounts for random and systematic effects associated with imperfect knowledge of model complexity. We select the fitting bandwidth that minimizes this uncertainty. In the most recent measurement of the Boltzmann constant, results were determined, in part, by application of an earlier version of the method described here. Here, we extend the earlier analysis by considering a broader range of fitting bandwidths and quantify an additional component of uncertainty that accounts for imperfect performance of our fitting bandwidth selection method. For idealized simulated data with additive noise similar to experimental data, our method correctly selects the true complexity of the ratio spectrum model for all cases considered. A new analysis of data from the recent experiment yields evidence for a temporal trend in the offset parameters.

10.
J Anim Breed Genet ; 132(3): 218-28, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25727456

RESUMO

Bootstrap aggregation (bagging) is a resampling method known to produce more accurate predictions when predictors are unstable or when the number of markers is much larger than sample size, because of variance reduction capabilities. The purpose of this study was to compare genomic best linear unbiased prediction (GBLUP) with bootstrap aggregated sampling GBLUP (Bagged GBLUP, or BGBLUP) in terms of prediction accuracy. We used a 600 K Affymetrix platform with 1351 birds genotyped and phenotyped for three traits in broiler chickens; body weight, ultrasound measurement of breast muscle and hen house egg production. The predictive performance of GBLUP versus BGBLUP was evaluated in different scenarios consisting of including or excluding the TOP 20 markers from a standard genome-wide association study (GWAS) as fixed effects in the GBLUP model, and varying training sample sizes and allelic frequency bins. Predictive performance was assessed via five replications of a threefold cross-validation using the correlation between observed and predicted values, and prediction mean-squared error. GBLUP overfitted the training set data, and BGBLUP delivered a better predictive ability in testing sets. Treating the TOP 20 markers from the GWAS into the model as fixed effects improved prediction accuracy and added advantages to BGBLUP over GBLUP. The performance of GBLUP and BGBLUP at different allele frequency bins and training sample sizes was similar. In general, results of this study confirm that BGBLUP can be valuable for enhancing genome-enabled prediction of complex traits.


Assuntos
Galinhas/genética , Genômica/métodos , Animais , Peso Corporal/genética , Galinhas/crescimento & desenvolvimento , Galinhas/metabolismo , Feminino , Frequência do Gene , Aprendizado de Máquina , Masculino , Glândulas Mamárias Animais/diagnóstico por imagem , Óvulo/metabolismo , Fenótipo , Ultrassonografia
11.
Appl Radiat Isot ; 210: 111341, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38744039

RESUMO

We developed a novel quadratic resampling method for summing up γ-ray spectra with different calibration parameters. We investigated a long-term environmental background γ-ray spectrum by summing up 114 spectra measured using a 30% HPGe detector between 2017 and 2021. Gain variations in different measurement periods shift γ-ray peak positions by a fractional pulse-height bin size up to around 2 keV. The resampling method was applied to measure low-level background γ-ray peaks in the γ-ray spectrum in a wide energy range from 50 keV to 3 MeV. We additionally document temporal variations in the activities of major γ-ray peaks, such as 40K (1461 keV), 208Tl (2615 keV), and other typical nuclides, along with contributions from cosmic rays. The normal distribution of γ-ray background count rates, as evidenced by quantile-quantile plots, indicates consistent data collection throughout the measurement period. Consequently, we assert that the quadratic resampling method for accumulating γ-ray spectra surpasses the linear method (Bossew, 2005) in various aspects.

12.
Eur J Neurol ; 20(10): 1423-5, 2013 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-23293907

RESUMO

BACKGROUND AND PURPOSE: The presence of cognitive impairments (CI) among Benign MS (BMS) patients has challenged actual BMS criteria. We hypothesized that a low evoked potentials score (EP-score) at first neurological evaluation would help identify BMS patients without CI. METHODS: The EP-score was retrospectively computed in 29 putative BMS patients who were then tested for CI during 2012. The difference in the prevalence of CI between low EP-score patients and the recent literature was assessed using resampling methods. RESULTS: Among 23 low EP-score patients, only 3 (13%) had CI. This percentage was significantly reduced (P-values 0.05-0.005) compared to recent literature (39-46%). CONCLUSION: We conclude that a low EP-score at first neurological evaluation successfully helps to identify BMS patients without CI.


Assuntos
Potenciais Evocados/fisiologia , Esclerose Múltipla Recidivante-Remitente , Adulto , Transtornos Cognitivos/etiologia , Transtornos Cognitivos/fisiopatologia , Feminino , Humanos , Masculino , Esclerose Múltipla Recidivante-Remitente/complicações , Esclerose Múltipla Recidivante-Remitente/fisiopatologia
13.
Am J Phys Anthropol ; 152(3): 393-406, 2013 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-24104947

RESUMO

Previous analyses of hand morphology in Australopithecus afarensis have concluded that this taxon had modern human-like manual proportions, with relatively long thumbs and short fingers. These conclusions are based on the A.L.333 composite fossil assemblage from Hadar, Ethiopia, and are premised on the ability to assign phalanges to a single individual, and to the correct side and digit. Neither assignment is secure, however, given the taphonomy and sample composition at A.L.333. We use a resampling approach that includes the entire assemblage of complete hand elements at Hadar, and takes into account uncertainties in identifying phalanges by individual, side and digit number. This approach provides the most conservative estimates of manual proportions in Au. afarensis. We resampled hand long bone lengths in Au. afarensis and extant hominoids, and obtained confidence limits for distributions of manual proportions in the latter. Results confirm that intrinsic manual proportions in Au. afarensis are dissimilar to Pan and Pongo. However, manual proportions in Au. afarensis often fall at the upper end of the distribution in Gorilla, and very lower end in Homo, corresponding to disproportionately short thumbs and long medial digits in Homo. This suggests that manual proportions in Au. afarensis, particularly metacarpal proportions, were not as derived towards Homo as previously described, but rather are intermediate between gorillas and humans. Functionally, these results suggest Au. afarensis could not produce precision grips with the same efficiency as modern humans, which may in part account for the absence of lithic technology in this fossil taxon.


Assuntos
Fósseis , Ossos da Mão/anatomia & histologia , Hominidae/anatomia & histologia , Animais , Antropometria , Feminino , Masculino
14.
J Appl Stat ; 49(5): 1179-1202, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35707515

RESUMO

The bootstrap procedure has emerged as a general framework to construct prediction intervals for future observations in autoregressive time series models. Such models with outlying data points are standard in real data applications, especially in the field of econometrics. These outlying data points tend to produce high forecast errors, which reduce the forecasting performances of the existing bootstrap prediction intervals calculated based on non-robust estimators. In the univariate and multivariate autoregressive time series, we propose a robust bootstrap algorithm for constructing prediction intervals and forecast regions. The proposed procedure is based on the weighted likelihood estimates and weighted residuals. Its finite sample properties are examined via a series of Monte Carlo studies and two empirical data examples.

15.
Stat (Int Stat Inst) ; 11(1): e470, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-36589778

RESUMO

An app-based clinical trial enrolment process can contribute to duplicated records, carrying data management implications. Our objective was to identify duplicated records in real time in the Apple Heart Study (AHS). We leveraged personal identifiable information (PII) to develop a dissimilarity score (DS) using the Damerau-Levenshtein distance. For computational efficiency, we focused on four types of records at the highest risk of duplication. We used the receiver operating curve (ROC) and resampling methods to derive and validate a decision rule to classify duplicated records. We identified 16,398 (4%) duplicated participants, resulting in 419,297 unique participants out of a total of 438,435 possible. Our decision rule yielded a high positive predictive value (96%) with negligible impact on the trial's original findings. Our findings provide principled solutions for future digital trials. When establishing deduplication procedures for digital trials, we recommend collecting device identifiers in addition to participant identifiers; collecting and ensuring secure access to PII; conducting a pilot study to identify reasons for duplicated records; establishing an initial deduplication algorithm that can be refined; creating a data quality plan that informs refinement; and embedding the initial deduplication algorithm in the enrolment platform to ensure unique enrolment and linkage to previous records.

16.
PeerJ Comput Sci ; 8: e573, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35634102

RESUMO

The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to inaccurate future predictions owing to biased learning and ineffective defect prediction. In addition to this large number of software metrics degrades the model performance. This study aims at (1) identification of useful metrics in the software using correlation feature selection, (2) extensive comparative analysis of 10 resampling methods to generate effective machine learning models for imbalanced data, (3) inclusion of stable performance evaluators-AUC, GMean, and Balance and (4) integration of statistical validation of results. The impact of 10 resampling methods is analyzed on selected features of 12 object-oriented Apache datasets using 15 machine learning techniques. The performances of developed models are analyzed using AUC, GMean, Balance, and sensitivity. Statistical results advocate the use of resampling methods to improve SDP. Random oversampling portrays the best predictive capability of developed defect prediction models. The study provides a guideline for identifying metrics that are influential for SDP. The performances of oversampling methods are superior to undersampling methods.

17.
Future Med Chem ; 14(10): 701-715, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35393862

RESUMO

Background: Marburg virus (MARV) is a sporadic outbreak of a zoonotic disease that causes lethal hemorrhagic fever in humans. We propose a deep learning model with resampling techniques and predict the inhibitory activity of MARV from unknown compounds in the virtual screening process. Methodology & results: We applied resampling techniques to solve the imbalanced data problem. The classifier model comparisons revealed that the hybrid model of synthetic minority oversampling technique - edited nearest neighbor and artificial neural network (SMOTE-ENN + ANN) achieved better classification performance with 95% overall accuracy. The trained SMOTE-ENN+ANN hybrid model predicted as lead molecules; 25 out of 87,043 from ChemDiv, four out of 340 from ChEMBL anti-viral library, three out of 918 from Phytochemical database, and seven out of 419 from Natural products from NCI divsetIV, and 214 out of 1,12,267 from Natural compounds ZINC database for MARV. Conclusion: Our studies reveal that the proposed SMOTE-ENN + ANN hybrid model can improve overall accuracy more effectively and predict new lead molecules against MARV.


Assuntos
Aprendizado Profundo , Marburgvirus , Algoritmos , Análise por Conglomerados , Humanos , Redes Neurais de Computação
18.
J Multivar Anal ; 1832021 May.
Artigo em Inglês | MEDLINE | ID: mdl-33518826

RESUMO

Canonical correlation analysis (CCA) is a common method used to estimate the associations between two different sets of variables by maximizing the Pearson correlation between linear combinations of the two sets of variables. We propose a version of CCA for transelliptical distributions with an elliptical copula using pairwise Kendall's tau to estimate a latent scatter matrix. Because Kendall's tau relies only on the ranks of the data this method does not make any assumptions about the marginal distributions of the variables, and is valid when moments do not exist. We establish consistency and asymptotic normality for canonical directions and correlations estimated using Kendall's tau. Simulations indicate that this estimator outperforms standard CCA for data generated from heavy tailed elliptical distributions. Our method also identifies more meaningful relationships when the marginal distributions are skewed. We also propose a method for testing for non-zero canonical correlations using bootstrap methods. This testing procedure does not require any assumptions on the joint distribution of the variables and works for all elliptical copulas. This is in contrast to permutation tests which are only valid when data are generated from a distribution with a Gaussian copula. This method's practical utility is shown in an analysis of the association between radial diffusivity in white matter tracts and cognitive tests scores for six-year-old children from the Early Brain Development Study at UNC-Chapel Hill. An R package implementing this method is available at github.com/blangworthy/transCCA.

19.
Risk Manag Healthc Policy ; 14: 3711-3720, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34522147

RESUMO

OBJECTIVE: The goal of this study was to establish the most efficient boosting method in predicting neonatal low Apgar scores following labor induction intervention and to assess whether resampling strategies would improve the predictive performance of the selected boosting algorithms. METHODS: A total of 7716 singleton births delivered from 2000 to 2015 were analyzed. Cesarean deliveries following labor induction, deliveries with abnormal presentation, and deliveries with missing Apgar score or delivery mode information were excluded. We examined the effect of resampling approaches or data preprocessing on predicting low Apgar scores, specifically the synthetic minority oversampling technique (SMOTE), borderline-SMOTE, and the random undersampling (RUS) technique. Sensitivity, specificity, precision, area under receiver operating curve (AUROC), F-score, positive predicted values (PPV), negative predicted values (NPV) and accuracy of the three (3) boosting-based ensemble methods were used to evaluate their discriminative ability. The ensemble learning models tested include adoptive boosting (AdaBoost), gradient boosting (GB) and extreme gradient boosting method (XGBoost). RESULTS: The prevalence of low (<7) Apgar scores was 9.5% (n = 733). The prediction models performed nearly similar in their baseline mode. Following the application of resampling techniques, borderline-SMOTE significantly improved the predictive performance of all the boosting-based ensemble methods under observation in terms of sensitivity, F1-score, AUROC and PPV. CONCLUSION: Policymakers, healthcare informaticians and neonatologists should consider implementing data preprocessing strategies when predicting a neonatal outcome with imbalanced data to enhance efficiency. The process may be more effective when borderline-SMOTE technique is deployed on the selected ensemble classifiers. However, future research may focus on testing additional resampling techniques, performing feature engineering, variable selection and optimizing further the ensemble learning hyperparameters.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa