Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 406
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
Biostatistics ; 25(2): 486-503, 2024 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-36797830

RESUMO

In prospective genomic studies (e.g., DNA methylation, metagenomics, and transcriptomics), it is crucial to estimate the overall fraction of phenotypic variance (OFPV) attributed to the high-dimensional genomic variables, a concept similar to heritability analyses in genome-wide association studies (GWAS). Unlike genetic variants in GWAS, these genomic variables are typically measured with error due to technical limitation and temporal instability. While the existing methods developed for GWAS can be used, ignoring measurement error may severely underestimate OFPV and mislead the design of future studies. Assuming that measurement error variances are distributed similarly between causal and noncausal variables, we show that the asymptotic attenuation factor equals to the average intraclass correlation coefficients of all genomic variables, which can be estimated based on a pilot study with repeated measurements. We illustrate the method by estimating the contribution of microbiome taxa to body mass index and multiple allergy traits in the American Gut Project. Finally, we show that measurement error does not cause meaningful bias when estimating the correlation of effect sizes for two traits.


Assuntos
Estudo de Associação Genômica Ampla , Genoma , Humanos , Estudo de Associação Genômica Ampla/métodos , Projetos Piloto , Estudos Prospectivos , Fenótipo , Polimorfismo de Nucleotídeo Único
2.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37889118

RESUMO

Selecting informative features, such as accurate biomarkers for disease diagnosis, prognosis and response to treatment, is an essential task in the field of bioinformatics. Medical data often contain thousands of features and identifying potential biomarkers is challenging due to small number of samples in the data, method dependence and non-reproducibility. This paper proposes a novel ensemble feature selection method, named Filter and Wrapper Stacking Ensemble (FWSE), to identify reproducible biomarkers from high-dimensional omics data. In FWSE, filter feature selection methods are run on numerous subsets of the data to eliminate irrelevant features, and then wrapper feature selection methods are applied to rank the top features. The method was validated on four high-dimensional medical datasets related to mental illnesses and cancer. The results indicate that the features selected by FWSE are stable and statistically more significant than the ones obtained by existing methods while also demonstrating biological relevance. Furthermore, FWSE is a generic method, applicable to various high-dimensional datasets in the fields of machine intelligence and bioinformatics.


Assuntos
Transtornos Mentais , Neoplasias , Humanos , Algoritmos , Inteligência Artificial , Biomarcadores , Neoplasias/diagnóstico , Neoplasias/genética
3.
BMC Bioinformatics ; 25(1): 57, 2024 Feb 05.
Artigo em Inglês | MEDLINE | ID: mdl-38317067

RESUMO

BACKGROUND: Controlling the False Discovery Rate (FDR) in Multiple Comparison Procedures (MCPs) has widespread applications in many scientific fields. Previous studies show that the correlation structure between test statistics increases the variance and bias of FDR. The objective of this study is to modify the effect of correlation in MCPs based on the information theory. We proposed three modified procedures (M1, M2, and M3) under strong, moderate, and mild assumptions based on the conditional Fisher Information of the consecutive sorted test statistics for controlling the false discovery rate under arbitrary correlation structure. The performance of the proposed procedures was compared with the Benjamini-Hochberg (BH) and Benjamini-Yekutieli (BY) procedures in simulation study and real high-dimensional data of colorectal cancer gene expressions. In the simulation study, we generated 1000 differential multivariate Gaussian features with different levels of the correlation structure and screened the significance features by the FDR controlling procedures, with strong control on the Family Wise Error Rates. RESULTS: When there was no correlation between 1000 simulated features, the performance of the BH procedure was similar to the three proposed procedures. In low to medium correlation structures the BY procedure is too conservative. The BH procedure is too liberal, and the mean number of screened features was constant at the different levels of the correlation between features. The mean number of screened features by proposed procedures was between BY and BH procedures and reduced when the correlations increased. Where the features are highly correlated the number of screened features by proposed procedures reached the Bonferroni (BF) procedure, as expected. In real data analysis the BY, BH, M1, M2, and M3 procedures were done to screen gene expressions of colorectal cancer. To fit a predictive model based on the screened features the Efficient Bayesian Logistic Regression (EBLR) model was used. The fitted EBLR models based on the screened features by M1 and M2 procedures have minimum entropies and are more efficient than BY and BH procedures. CONCLUSION: The modified proposed procedures based on information theory, are much more flexible than BH and BY procedures for the amount of correlation between test statistics. The modified procedures avoided screening the non-informative features and so the number of screened features reduced with the increase in the level of correlation.


Assuntos
Neoplasias Colorretais , Teoria da Informação , Humanos , Teorema de Bayes , Genômica , Simulação por Computador
4.
BMC Genomics ; 25(1): 152, 2024 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-38326768

RESUMO

BACKGROUND: The accurate prediction of genomic breeding values is central to genomic selection in both plant and animal breeding studies. Genomic prediction involves the use of thousands of molecular markers spanning the entire genome and therefore requires methods able to efficiently handle high dimensional data. Not surprisingly, machine learning methods are becoming widely advocated for and used in genomic prediction studies. These methods encompass different groups of supervised and unsupervised learning methods. Although several studies have compared the predictive performances of individual methods, studies comparing the predictive performance of different groups of methods are rare. However, such studies are crucial for identifying (i) groups of methods with superior genomic predictive performance and assessing (ii) the merits and demerits of such groups of methods relative to each other and to the established classical methods. Here, we comparatively evaluate the genomic predictive performance and informally assess the computational cost of several groups of supervised machine learning methods, specifically, regularized regression methods, deep, ensemble and instance-based learning algorithms, using one simulated animal breeding dataset and three empirical maize breeding datasets obtained from a commercial breeding program. RESULTS: Our results show that the relative predictive performance and computational expense of the groups of machine learning methods depend upon both the data and target traits and that for classical regularized methods, increasing model complexity can incur huge computational costs but does not necessarily always improve predictive accuracy. Thus, despite their greater complexity and computational burden, neither the adaptive nor the group regularized methods clearly improved upon the results of their simple regularized counterparts. This rules out selection of one procedure among machine learning methods for routine use in genomic prediction. The results also show that, because of their competitive predictive performance, computational efficiency, simplicity and therefore relatively few tuning parameters, the classical linear mixed model and regularized regression methods are likely to remain strong contenders for genomic prediction. CONCLUSIONS: The dependence of predictive performance and computational burden on target datasets and traits call for increasing investments in enhancing the computational efficiency of machine learning algorithms and computing resources.


Assuntos
Aprendizado Profundo , Animais , Melhoramento Vegetal , Genoma , Genômica/métodos , Aprendizado de Máquina
5.
Biostatistics ; 24(2): 327-344, 2023 04 14.
Artigo em Inglês | MEDLINE | ID: mdl-34165151

RESUMO

The existing cross-validated risk scores (CVRS) design has been proposed for developing and testing the efficacy of a treatment in a high-efficacy patient group (the sensitive group) using high-dimensional data (such as genetic data). The design is based on computing a risk score for each patient and dividing them into clusters using a nonparametric clustering procedure. In some settings, it is desirable to consider the tradeoff between two outcomes, such as efficacy and toxicity, or cost and effectiveness. With this motivation, we extend the CVRS design (CVRS2) to consider two outcomes. The design employs bivariate risk scores that are divided into clusters. We assess the properties of the CVRS2 using simulated data and illustrate its application on a randomized psychiatry trial. We show that CVRS2 is able to reliably identify the sensitive group (the group for which the new treatment provides benefit on both outcomes) in the simulated data. We apply the CVRS2 design to a psychology clinical trial that had offender status and substance use status as two outcomes and collected a large number of baseline covariates. The CVRS2 design yields a significant treatment effect for both outcomes, while the CVRS approach identified a significant effect for the offender status only after prefiltering the covariates.


Assuntos
Ensaios Clínicos como Assunto , Projetos de Pesquisa , Humanos , Fatores de Risco
6.
Biostatistics ; 24(4): 1085-1105, 2023 10 18.
Artigo em Inglês | MEDLINE | ID: mdl-35861622

RESUMO

An endeavor central to precision medicine is predictive biomarker discovery; they define patient subpopulations which stand to benefit most, or least, from a given treatment. The identification of these biomarkers is often the byproduct of the related but fundamentally different task of treatment rule estimation. Using treatment rule estimation methods to identify predictive biomarkers in clinical trials where the number of covariates exceeds the number of participants often results in high false discovery rates. The higher than expected number of false positives translates to wasted resources when conducting follow-up experiments for drug target identification and diagnostic assay development. Patient outcomes are in turn negatively affected. We propose a variable importance parameter for directly assessing the importance of potentially predictive biomarkers and develop a flexible nonparametric inference procedure for this estimand. We prove that our estimator is double robust and asymptotically linear under loose conditions in the data-generating process, permitting valid inference about the importance metric. The statistical guarantees of the method are verified in a thorough simulation study representative of randomized control trials with moderate and high-dimensional covariate vectors. Our procedure is then used to discover predictive biomarkers from among the tumor gene expression data of metastatic renal cell carcinoma patients enrolled in recently completed clinical trials. We find that our approach more readily discerns predictive from nonpredictive biomarkers than procedures whose primary purpose is treatment rule estimation. An open-source software implementation of the methodology, the uniCATE R package, is briefly introduced.


Assuntos
Pesquisa Biomédica , Carcinoma de Células Renais , Neoplasias Renais , Humanos , Carcinoma de Células Renais/diagnóstico , Carcinoma de Células Renais/genética , Neoplasias Renais/diagnóstico , Neoplasias Renais/genética , Biomarcadores , Simulação por Computador
7.
Biostatistics ; 24(4): 985-999, 2023 10 18.
Artigo em Inglês | MEDLINE | ID: mdl-35791753

RESUMO

When evaluating the effectiveness of a treatment, policy, or intervention, the desired measure of efficacy may be expensive to collect, not routinely available, or may take a long time to occur. In these cases, it is sometimes possible to identify a surrogate outcome that can more easily, quickly, or cheaply capture the effect of interest. Theory and methods for evaluating the strength of surrogate markers have been well studied in the context of a single surrogate marker measured in the course of a randomized clinical study. However, methods are lacking for quantifying the utility of surrogate markers when the dimension of the surrogate grows. We propose a robust and efficient method for evaluating a set of surrogate markers that may be high-dimensional. Our method does not require treatment to be randomized and may be used in observational studies. Our approach draws on a connection between quantifying the utility of a surrogate marker and the most fundamental tools of causal inference-namely, methods for robust estimation of the average treatment effect. This connection facilitates the use of modern methods for estimating treatment effects, using machine learning to estimate nuisance functions and relaxing the dependence on model specification. We demonstrate that our proposed approach performs well, demonstrate connections between our approach and certain mediation effects, and illustrate it by evaluating whether gene expression can be used as a surrogate for immune activation in an Ebola study.


Assuntos
Modelos Estatísticos , Humanos , Biomarcadores , Causalidade , Simulação por Computador
8.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34498681

RESUMO

Feature selection is crucial for the analysis of high-dimensional data, but benchmark studies for data with a survival outcome are rare. We compare 14 filter methods for feature selection based on 11 high-dimensional gene expression survival data sets. The aim is to provide guidance on the choice of filter methods for other researchers and practitioners. We analyze the accuracy of predictive models that employ the features selected by the filter methods. Also, we consider the run time, the number of selected features for fitting models with high predictive accuracy as well as the feature selection stability. We conclude that the simple variance filter outperforms all other considered filter methods. This filter selects the features with the largest variance and does not take into account the survival outcome. Also, we identify the correlation-adjusted regression scores filter as a more elaborate alternative that allows fitting models with similar predictive accuracy. Additionally, we investigate the filter methods based on feature rankings, finding groups of similar filters.


Assuntos
Algoritmos , Benchmarking , Expressão Gênica
9.
Brief Bioinform ; 23(4)2022 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-35667004

RESUMO

In recent work, researchers have paid considerable attention to the estimation of causal effects in observational studies with a large number of covariates, which makes the unconfoundedness assumption plausible. In this paper, we review propensity score (PS) methods developed in high-dimensional settings and broadly group them into model-based methods that extend models for prediction to causal inference and balance-based methods that combine covariate balancing constraints. We conducted systematic simulation experiments to evaluate these two types of methods, and studied whether the use of balancing constraints further improved estimation performance. Our comparison methods were post-double-selection (PDS), double-index PS (DiPS), outcome-adaptive LASSO (OAL), group LASSO and doubly robust estimation (GLiDeR), high-dimensional covariate balancing PS (hdCBPS), regularized calibrated estimators (RCAL) and approximate residual balancing method (balanceHD). For the four model-based methods, simulation studies showed that GLiDeR was the most stable approach, with high estimation accuracy and precision, followed by PDS, OAL and DiPS. For balance-based methods, hdCBPS performed similarly to GLiDeR in terms of accuracy, and outperformed balanceHD and RCAL. These findings imply that PS methods do not benefit appreciably from covariate balancing constraints in high-dimensional settings. In conclusion, we recommend the preferential use of GLiDeR and hdCBPS approaches for estimating causal effects in high-dimensional settings; however, further studies on the construction of valid confidence intervals are required.


Assuntos
Modelos Estatísticos , Causalidade , Simulação por Computador , Pontuação de Propensão
10.
Metabolomics ; 20(2): 35, 2024 Mar 05.
Artigo em Inglês | MEDLINE | ID: mdl-38441696

RESUMO

INTRODUCTION: Longitudinal biomarkers in patients with community-acquired pneumonia (CAP) may help in monitoring of disease progression and treatment response. The metabolic host response could be a potential source of such biomarkers since it closely associates with the current health status of the patient. OBJECTIVES: In this study we performed longitudinal metabolite profiling in patients with CAP for a comprehensive range of metabolites to identify potential host response biomarkers. METHODS: Previously collected serum samples from CAP patients with confirmed Streptococcus pneumoniae infection (n = 25) were used. Samples were collected at multiple time points, up to 30 days after admission. A wide range of metabolites was measured, including amines, acylcarnitines, organic acids, and lipids. The associations between metabolites and C-reactive protein (CRP), procalcitonin, CURB disease severity score at admission, and total length of stay were evaluated. RESULTS: Distinct longitudinal profiles of metabolite profiles were identified, including cholesteryl esters, diacyl-phosphatidylethanolamine, diacylglycerols, lysophosphatidylcholines, sphingomyelin, and triglycerides. Positive correlations were found between CRP and phosphatidylcholine (34:1) (cor = 0.63) and negative correlations were found for CRP and nine lysophosphocholines (cor = - 0.57 to - 0.74). The CURB disease severity score was negatively associated with six metabolites, including acylcarnitines (tau = - 0.64 to - 0.58). Negative correlations were found between the length of stay and six triglycerides (TGs), especially TGs (60:3) and (58:2) (cor = - 0.63 and - 0.61). CONCLUSION: The identified metabolites may provide insight into biological mechanisms underlying disease severity and may be of interest for exploration as potential treatment response monitoring biomarker.


Assuntos
Pneumonia , Streptococcus pneumoniae , Humanos , Metabolômica , Proteína C-Reativa , Biomarcadores , Triglicerídeos
11.
Biometrics ; 80(3)2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-39248122

RESUMO

The geometric median, which is applicable to high-dimensional data, can be viewed as a generalization of the univariate median used in 1-dimensional data. It can be used as a robust estimator for identifying the location of multi-dimensional data and has a wide range of applications in real-world scenarios. This paper explores the problem of high-dimensional multivariate analysis of variance (MANOVA) using the geometric median. A maximum-type statistic that relies on the differences between the geometric medians among various groups is introduced. The distribution of the new test statistic is derived under the null hypothesis using Gaussian approximations, and its consistency under the alternative hypothesis is established. To approximate the distribution of the new statistic in high dimensions, a wild bootstrap algorithm is proposed and theoretically justified. Through simulation studies conducted across a variety of dimensions, sample sizes, and data-generating models, we demonstrate the finite-sample performance of our geometric median-based MANOVA method. Additionally, we implement the proposed approach to analyze a breast cancer gene expression dataset.


Assuntos
Algoritmos , Neoplasias da Mama , Simulação por Computador , Humanos , Análise Multivariada , Neoplasias da Mama/genética , Modelos Estatísticos , Feminino , Interpretação Estatística de Dados , Perfilação da Expressão Gênica/estatística & dados numéricos , Tamanho da Amostra , Biometria/métodos
12.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38465987

RESUMO

High-dimensional data sets are often available in genome-enabled predictions. Such data sets include nonlinear relationships with complex dependence structures. For such situations, vine copula-based (quantile) regression is an important tool. However, the current vine copula-based regression approaches do not scale up to high and ultra-high dimensions. To perform high-dimensional sparse vine copula-based regression, we propose 2 methods. First, we show their superiority regarding computational complexity over the existing methods. Second, we define relevant, irrelevant, and redundant explanatory variables for quantile regression. Then, we show our method's power in selecting relevant variables and prediction accuracy in high-dimensional sparse data sets via simulation studies. Next, we apply the proposed methods to the high-dimensional real data, aiming at the genomic prediction of maize traits. Some data processing and feature extraction steps for the real data are further discussed. Finally, we show the advantage of our methods over linear models and quantile regression forests in simulation studies and real data applications.


Assuntos
Genoma , Genômica , Genômica/métodos , Simulação por Computador , Modelos Lineares , Fenótipo
13.
Stat Med ; 43(19): 3633-3648, 2024 Aug 30.
Artigo em Inglês | MEDLINE | ID: mdl-38885953

RESUMO

Recent advances in engineering technologies have enabled the collection of a large number of longitudinal features. This wealth of information presents unique opportunities for researchers to investigate the complex nature of diseases and uncover underlying disease mechanisms. However, analyzing such kind of data can be difficult due to its high dimensionality, heterogeneity and computational challenges. In this article, we propose a Bayesian nonparametric mixture model for clustering high-dimensional mixed-type (eg, continuous, discrete and categorical) longitudinal features. We employ a sparse factor model on the joint distribution of random effects and the key idea is to induce clustering at the latent factor level instead of the original data to escape the curse of dimensionality. The number of clusters is estimated through a Dirichlet process prior. An efficient Gibbs sampler is developed to estimate the posterior distribution of the model parameters. Analysis of real and simulated data is presented and discussed. Our study demonstrates that the proposed model serves as a useful analytical tool for clustering high-dimensional longitudinal data.


Assuntos
Teorema de Bayes , Modelos Estatísticos , Estudos Longitudinais , Análise por Conglomerados , Humanos , Simulação por Computador
14.
Stat Med ; 43(1): 1-15, 2024 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-37875428

RESUMO

Wide heterogeneity exists in cancer patients' survival, ranging from a few months to several decades. To accurately predict clinical outcomes, it is vital to build an accurate predictive model that relates the patients' molecular profiles with the patients' survival. With complex relationships between survival and high-dimensional molecular predictors, it is challenging to conduct nonparametric modeling and irrelevant predictors removing simultaneously. In this article, we build a kernel Cox proportional hazards semi-parametric model and propose a novel regularized garrotized kernel machine (RegGKM) method to fit the model. We use the kernel machine method to describe the complex relationship between survival and predictors, while automatically removing irrelevant parametric and nonparametric predictors through a LASSO penalty. An efficient high-dimensional algorithm is developed for the proposed method. Comparison with other competing methods in simulation shows that the proposed method always has better predictive accuracy. We apply this method to analyze a multiple myeloma dataset and predict the patients' death burden based on their gene expressions. Our results can help classify patients into groups with different death risks, facilitating treatment for better clinical outcomes.


Assuntos
Algoritmos , Neoplasias , Humanos , Modelos Lineares , Modelos de Riscos Proporcionais , Simulação por Computador , Neoplasias/genética
15.
Stat Med ; 43(17): 3164-3183, 2024 Jul 30.
Artigo em Inglês | MEDLINE | ID: mdl-38807296

RESUMO

Cox models with time-dependent coefficients and covariates are widely used in survival analysis. In high-dimensional settings, sparse regularization techniques are employed for variable selection, but existing methods for time-dependent Cox models lack flexibility in enforcing specific sparsity patterns (ie, covariate structures). We propose a flexible framework for variable selection in time-dependent Cox models, accommodating complex selection rules. Our method can adapt to arbitrary grouping structures, including interaction selection, temporal, spatial, tree, and directed acyclic graph structures. It achieves accurate estimation with low false alarm rates. We develop the sox package, implementing a network flow algorithm for efficiently solving models with complex covariate structures. sox offers a user-friendly interface for specifying grouping structures and delivers fast computation. Through examples, including a case study on identifying predictors of time to all-cause death in atrial fibrillation patients, we demonstrate the practical application of our method with specific selection rules.


Assuntos
Algoritmos , Modelos de Riscos Proporcionais , Humanos , Análise de Sobrevida , Fibrilação Atrial , Fatores de Tempo , Simulação por Computador
16.
BMC Med Inform Decis Mak ; 24(1): 120, 2024 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-38715002

RESUMO

In recent times, time-to-event data such as time to failure or death is routinely collected alongside high-throughput covariates. These high-dimensional bioinformatics data often challenge classical survival models, which are either infeasible to fit or produce low prediction accuracy due to overfitting. To address this issue, the focus has shifted towards introducing a novel approaches for feature selection and survival prediction. In this article, we propose a new hybrid feature selection approach that handles high-dimensional bioinformatics datasets for improved survival prediction. This study explores the efficacy of four distinct variable selection techniques: LASSO, RSF-vs, SCAD, and CoxBoost, in the context of non-parametric biomedical survival prediction. Leveraging these methods, we conducted comprehensive variable selection processes. Subsequently, survival analysis models-specifically CoxPH, RSF, and DeepHit NN-were employed to construct predictive models based on the selected variables. Furthermore, we introduce a novel approach wherein only variables consistently selected by a majority of the aforementioned feature selection techniques are considered. This innovative strategy, referred to as the proposed method, aims to enhance the reliability and robustness of variable selection, subsequently improving the predictive performance of the survival analysis models. To evaluate the effectiveness of the proposed method, we compare the performance of the proposed approach with the existing LASSO, RSF-vs, SCAD, and CoxBoost techniques using various performance metrics including integrated brier score (IBS), concordance index (C-Index) and integrated absolute error (IAE) for numerous high-dimensional survival datasets. The real data applications reveal that the proposed method outperforms the competing methods in terms of survival prediction accuracy.


Assuntos
Redes Neurais de Computação , Humanos , Análise de Sobrevida , Estatísticas não Paramétricas , Biologia Computacional/métodos
17.
Biom J ; 66(1): e2200207, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37421205

RESUMO

Variable selection methods based on L0 penalties have excellent theoretical properties to select sparse models in a high-dimensional setting. There exist modifications of the Bayesian Information Criterion (BIC) which either control the familywise error rate (mBIC) or the false discovery rate (mBIC2) in terms of which regressors are selected to enter a model. However, the minimization of L0 penalties comprises a mixed-integer problem which is known to be NP-hard and therefore becomes computationally challenging with increasing numbers of regressor variables. This is one reason why alternatives like the LASSO have become so popular, which involve convex optimization problems that are easier to solve. The last few years have seen some real progress in developing new algorithms to minimize L0 penalties. The aim of this article is to compare the performance of these algorithms in terms of minimizing L0 -based selection criteria. Simulation studies covering a wide range of scenarios that are inspired by genetic association studies are used to compare the values of selection criteria obtained with different algorithms. In addition, some statistical characteristics of the selected models and the runtime of algorithms are compared. Finally, the performance of the algorithms is illustrated in a real data example concerned with expression quantitative trait loci (eQTL) mapping.


Assuntos
Algoritmos , Locos de Características Quantitativas , Teorema de Bayes , Simulação por Computador
18.
Behav Res Methods ; 56(3): 1715-1737, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-37540467

RESUMO

Multiple Imputation (MI) is one of the most popular approaches to addressing missing values in questionnaires and surveys. MI with multivariate imputation by chained equations (MICE) allows flexible imputation of many types of data. In MICE, for each variable under imputation, the imputer needs to specify which variables should act as predictors in the imputation model. The selection of these predictors is a difficult, but fundamental, step in the MI procedure, especially when there are many variables in a data set. In this project, we explore the use of principal component regression (PCR) as a univariate imputation method in the MICE algorithm to automatically address the many-variables problem that arises when imputing large social science data. We compare different implementations of PCR-based MICE with a correlation-thresholding strategy through two Monte Carlo simulation studies and a case study. We find the use of PCR on a variable-by-variable basis to perform best and that it can perform closely to expertly designed imputation procedures.


Assuntos
Algoritmos , Humanos , Simulação por Computador , Inquéritos e Questionários , Método de Monte Carlo
19.
BMC Bioinformatics ; 24(1): 392, 2023 Oct 18.
Artigo em Inglês | MEDLINE | ID: mdl-37853338

RESUMO

BACKGROUND: Feature selection is important in high dimensional data analysis. The wrapper approach is one of the ways to perform feature selection, but it is computationally intensive as it builds and evaluates models of multiple subsets of features. The existing wrapper algorithm primarily focuses on shortening the path to find an optimal feature set. However, it underutilizes the capability of feature subset models, which impacts feature selection and its predictive performance. METHOD AND RESULTS: This study proposes a novel Artificial Intelligence based Wrapper (AIWrap) algorithm that integrates Artificial Intelligence (AI) with the existing wrapper algorithm. The algorithm develops a Performance Prediction Model using AI which predicts the model performance of any feature set and allows the wrapper algorithm to evaluate the feature subset performance in a model without building the model. The algorithm can make the wrapper algorithm more relevant for high-dimensional data. We evaluate the performance of this algorithm using simulated studies and real research studies. AIWrap shows better or at par feature selection and model prediction performance than standard penalized feature selection algorithms and wrapper algorithms. CONCLUSION: AIWrap approach provides an alternative algorithm to the existing algorithms for feature selection. The current study focuses on AIWrap application in continuous cross-sectional data. However, it could be applied to other datasets like longitudinal, categorical and time-to-event biological data.


Assuntos
Algoritmos , Inteligência Artificial , Estudos Transversais
20.
BMC Bioinformatics ; 24(1): 393, 2023 Oct 19.
Artigo em Inglês | MEDLINE | ID: mdl-37858091

RESUMO

BACKGROUND: An important problem in toxicology in the context of gene expression data is the simultaneous inference of a large number of concentration-response relationships. The quality of the inference substantially depends on the choice of design of the experiments, in particular, on the set of different concentrations, at which observations are taken for the different genes under consideration. As this set has to be the same for all genes, the efficient planning of such experiments is very challenging. We address this problem by determining efficient designs for the simultaneous inference of a large number of concentration-response models. For that purpose, we both construct a D-optimality criterion for simultaneous inference and a K-means procedure which clusters the support points of the locally D-optimal designs of the individual models. RESULTS: We show that a planning of experiments that addresses the simultaneous inference of a large number of concentration-response relationships yields a substantially more accurate statistical analysis. In particular, we compare the performance of the constructed designs to the ones of other commonly used designs in terms of D-efficiencies and in terms of the quality of the resulting model fits using a real data example dealing with valproic acid. For the quality comparison we perform an extensive simulation study. CONCLUSIONS: The design maximizing the D-optimality criterion for simultaneous inference improves the inference of the different concentration-response relationships substantially. The design based on the K-means procedure also performs well, whereas a log-equidistant design, which was also included in the analysis, performs poorly in terms of the quality of the simultaneous inference. Based on our findings, the D-optimal design for simultaneous inference should be used for upcoming analyses dealing with high-dimensional gene expression data.


Assuntos
Projetos de Pesquisa , Simulação por Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA