Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 393
Filtrar
1.
Front Genet ; 15: 1415249, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38948357

RESUMO

In modern breeding practices, genomic prediction (GP) uses high-density single nucleotide polymorphisms (SNPs) markers to predict genomic estimated breeding values (GEBVs) for crucial phenotypes, thereby speeding up selection breeding process and shortening generation intervals. However, due to the characteristic of genotype data typically having far fewer sample numbers than SNPs markers, overfitting commonly arise during model training. To address this, the present study builds upon the Least Squares Twin Support Vector Regression (LSTSVR) model by incorporating a Lasso regularization term named ILSTSVR. Because of the complexity of parameter tuning for different datasets, subtraction average based optimizer (SABO) is further introduced to optimize ILSTSVR, and then obtain the GP model named SABO-ILSTSVR. Experiments conducted on four different crop datasets demonstrate that SABO-ILSTSVR outperforms or is equivalent in efficiency to widely-used genomic prediction methods. Source codes and data are available at: https://github.com/MLBreeding/SABO-ILSTSVR.

2.
Contemp Clin Trials ; 144: 107620, 2024 Jul 06.
Artigo em Inglês | MEDLINE | ID: mdl-38977178

RESUMO

We propose a Cross-validated ADaptive ENrichment design (CADEN) in which a trial population is enriched with a subpopulation of patients who are predicted to benefit from the treatment more than an average patient (the sensitive group). This subpopulation is found using a risk score constructed from the baseline (potentially high-dimensional) information about patients. The design incorporates an early stopping rule for futility. Simulation studies are used to assess the properties of CADEN against the original (non-enrichment) cross-validated risk scores (CVRS) design which constructs a risk score at the end of the trial. We show that when there exists a sensitive group of patients, CADEN achieves a higher power and a reduction in the expected sample size compared to the CVRS design. We illustrate the application of the design in two real clinical trials. We conclude that the new design offers improved statistical efficiency over the existing non-enrichment method, as well as increased benefit to patients. The method has been implemented in an R package caden.

3.
Biology (Basel) ; 13(7)2024 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-39056705

RESUMO

Single-cell transcriptomics (scRNA-seq) is revolutionizing biological research, yet it faces challenges such as inefficient transcript capture and noise. To address these challenges, methods like neighbor averaging or graph diffusion are used. These methods often rely on k-nearest neighbor graphs from low-dimensional manifolds. However, scRNA-seq data suffer from the 'curse of dimensionality', leading to the over-smoothing of data when using imputation methods. To overcome this, sc-PHENIX employs a PCA-UMAP diffusion method, which enhances the preservation of data structures and allows for a refined use of PCA dimensions and diffusion parameters (e.g., k-nearest neighbors, exponentiation of the Markov matrix) to minimize noise introduction. This approach enables a more accurate construction of the exponentiated Markov matrix (cell neighborhood graph), surpassing methods like MAGIC. sc-PHENIX significantly mitigates over-smoothing, as validated through various scRNA-seq datasets, demonstrating improved cell phenotype representation. Applied to a multicellular tumor spheroid dataset, sc-PHENIX identified known extreme phenotype states, showcasing its effectiveness. sc-PHENIX is open-source and available for use and modification.

4.
J Appl Stat ; 51(9): 1756-1771, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38933137

RESUMO

In many biomedical applications, we are more interested in the predicted probability that a numerical outcome is above a threshold than in the predicted value of the outcome. For example, it might be known that antibody levels above a certain threshold provide immunity against a disease, or a threshold for a disease severity score might reflect conversion from the presymptomatic to the symptomatic disease stage. Accordingly, biomedical researchers often convert numerical to binary outcomes (loss of information) to conduct logistic regression (probabilistic interpretation). We address this bad statistical practice by modelling the binary outcome with logistic regression, modelling the numerical outcome with linear regression, transforming the predicted values from linear regression to predicted probabilities, and combining the predicted probabilities from logistic and linear regression. Analysing high-dimensional simulated and experimental data, namely clinical data for predicting cognitive impairment, we obtain significantly improved predictions of dichotomised outcomes. Thus, the proposed approach effectively combines binary with numerical outcomes to improve binary classification in high-dimensional settings. An implementation is available in the R package cornet on GitHub (https://github.com/rauschenberger/cornet) and CRAN (https://CRAN.R-project.org/package=cornet).

5.
Stat Med ; 43(19): 3633-3648, 2024 Aug 30.
Artigo em Inglês | MEDLINE | ID: mdl-38885953

RESUMO

Recent advances in engineering technologies have enabled the collection of a large number of longitudinal features. This wealth of information presents unique opportunities for researchers to investigate the complex nature of diseases and uncover underlying disease mechanisms. However, analyzing such kind of data can be difficult due to its high dimensionality, heterogeneity and computational challenges. In this article, we propose a Bayesian nonparametric mixture model for clustering high-dimensional mixed-type (eg, continuous, discrete and categorical) longitudinal features. We employ a sparse factor model on the joint distribution of random effects and the key idea is to induce clustering at the latent factor level instead of the original data to escape the curse of dimensionality. The number of clusters is estimated through a Dirichlet process prior. An efficient Gibbs sampler is developed to estimate the posterior distribution of the model parameters. Analysis of real and simulated data is presented and discussed. Our study demonstrates that the proposed model serves as a useful analytical tool for clustering high-dimensional longitudinal data.


Assuntos
Teorema de Bayes , Modelos Estatísticos , Estudos Longitudinais , Análise por Conglomerados , Humanos , Simulação por Computador
6.
Methods Protoc ; 7(3)2024 Apr 24.
Artigo em Inglês | MEDLINE | ID: mdl-38804330

RESUMO

Robust data normalization and analysis are pivotal in biomedical research to ensure that observed differences in populations are directly attributable to the target variable, rather than disparities between control and study groups. ArsHive addresses this challenge using advanced algorithms to normalize populations (e.g., control and study groups) and perform statistical evaluations between demographic, clinical, and other variables within biomedical datasets, resulting in more balanced and unbiased analyses. The tool's functionality extends to comprehensive data reporting, which elucidates the effects of data processing, while maintaining dataset integrity. Additionally, ArsHive is complemented by A.D.A. (Autonomous Digital Assistant), which employs OpenAI's GPT-4 model to assist researchers with inquiries, enhancing the decision-making process. In this proof-of-concept study, we tested ArsHive on three different datasets derived from proprietary data, demonstrating its effectiveness in managing complex clinical and therapeutic information and highlighting its versatility for diverse research fields.

7.
Micromachines (Basel) ; 15(5)2024 May 13.
Artigo em Inglês | MEDLINE | ID: mdl-38793220

RESUMO

This paper pioneers a novel approach in electromagnetic (EM) system analysis by synergistically combining Bayesian Neural Networks (BNNs) informed by Latin Hypercube Sampling (LHS) with advanced thermal-mechanical surrogate modeling within COMSOL simulations for high-frequency low-pass filter modeling. Our methodology transcends traditional EM characterization by integrating physical dimension variability, thermal effects, mechanical deformation, and real-world operational conditions, thereby achieving a significant leap in predictive modeling fidelity. Through rigorous evaluation using Mean Squared Error (MSE), Maximum Learning Error (MLE), and Maximum Test Error (MTE) metrics, as well as comprehensive validation on unseen data, the model's robustness and generalization capability is demonstrated. This research challenges conventional methods, offering a nuanced understanding of multiphysical phenomena to enhance reliability and resilience in electronic component design and optimization. The integration of thermal variables alongside dimensional parameters marks a novel paradigm in filter performance analysis, significantly improving simulation accuracy. Our findings not only contribute to the body of knowledge in EM diagnostics and complex-environment analysis but also pave the way for future investigations into the fusion of machine learning with computational physics, promising transformative impacts across various applications, from telecommunications to medical devices.

8.
BMC Med Inform Decis Mak ; 24(1): 120, 2024 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-38715002

RESUMO

In recent times, time-to-event data such as time to failure or death is routinely collected alongside high-throughput covariates. These high-dimensional bioinformatics data often challenge classical survival models, which are either infeasible to fit or produce low prediction accuracy due to overfitting. To address this issue, the focus has shifted towards introducing a novel approaches for feature selection and survival prediction. In this article, we propose a new hybrid feature selection approach that handles high-dimensional bioinformatics datasets for improved survival prediction. This study explores the efficacy of four distinct variable selection techniques: LASSO, RSF-vs, SCAD, and CoxBoost, in the context of non-parametric biomedical survival prediction. Leveraging these methods, we conducted comprehensive variable selection processes. Subsequently, survival analysis models-specifically CoxPH, RSF, and DeepHit NN-were employed to construct predictive models based on the selected variables. Furthermore, we introduce a novel approach wherein only variables consistently selected by a majority of the aforementioned feature selection techniques are considered. This innovative strategy, referred to as the proposed method, aims to enhance the reliability and robustness of variable selection, subsequently improving the predictive performance of the survival analysis models. To evaluate the effectiveness of the proposed method, we compare the performance of the proposed approach with the existing LASSO, RSF-vs, SCAD, and CoxBoost techniques using various performance metrics including integrated brier score (IBS), concordance index (C-Index) and integrated absolute error (IAE) for numerous high-dimensional survival datasets. The real data applications reveal that the proposed method outperforms the competing methods in terms of survival prediction accuracy.


Assuntos
Redes Neurais de Computação , Humanos , Análise de Sobrevida , Estatísticas não Paramétricas , Biologia Computacional/métodos
9.
Sci Rep ; 14(1): 11202, 2024 05 16.
Artigo em Inglês | MEDLINE | ID: mdl-38755262

RESUMO

Measuring the dynamics of microbial communities results in high-dimensional measurements of taxa abundances over time and space, which is difficult to analyze due to complex changes in taxonomic compositions. This paper presents a new method to investigate and visualize the intrinsic hierarchical community structure implied by the measurements. The basic idea is to identify significant intersection sets, which can be seen as sub-communities making up the measured communities. Using the subset relationship, the intersection sets together with the measurements form a hierarchical structure visualized as a Hasse diagram. Chemical organization theory (COT) is used to relate the hierarchy of the sets of taxa to potential taxa interactions and to their potential dynamical persistence. The approach is demonstrated on a data set of community data obtained from bacterial 16S rRNA gene sequencing for samples collected monthly from four groundwater wells over a nearly 3-year period (n = 114) along a hillslope area. The significance of the hierarchies derived from the data is evaluated by showing that they significantly deviate from a random model. Furthermore, it is demonstrated how the hierarchy is related to temporal and spatial factors; and how the idea of a core microbiome can be extended to a set of interrelated core microbiomes. Together the results suggest that the approach can support developing models of taxa interactions in the future.


Assuntos
Bactérias , Microbiota , RNA Ribossômico 16S , Microbiota/genética , RNA Ribossômico 16S/genética , Bactérias/genética , Bactérias/classificação , Água Subterrânea/microbiologia
10.
Stat Med ; 43(17): 3164-3183, 2024 Jul 30.
Artigo em Inglês | MEDLINE | ID: mdl-38807296

RESUMO

Cox models with time-dependent coefficients and covariates are widely used in survival analysis. In high-dimensional settings, sparse regularization techniques are employed for variable selection, but existing methods for time-dependent Cox models lack flexibility in enforcing specific sparsity patterns (ie, covariate structures). We propose a flexible framework for variable selection in time-dependent Cox models, accommodating complex selection rules. Our method can adapt to arbitrary grouping structures, including interaction selection, temporal, spatial, tree, and directed acyclic graph structures. It achieves accurate estimation with low false alarm rates. We develop the sox package, implementing a network flow algorithm for efficiently solving models with complex covariate structures. sox offers a user-friendly interface for specifying grouping structures and delivers fast computation. Through examples, including a case study on identifying predictors of time to all-cause death in atrial fibrillation patients, we demonstrate the practical application of our method with specific selection rules.


Assuntos
Algoritmos , Modelos de Riscos Proporcionais , Humanos , Análise de Sobrevida , Fibrilação Atrial , Fatores de Tempo , Simulação por Computador
11.
EBioMedicine ; 103: 105130, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38653188

RESUMO

BACKGROUND: Active surveillance pharmacovigilance is an emerging approach to identify medications with unanticipated effects. We previously developed a framework called pharmacopeia-wide association studies (PharmWAS) that limits false positive medication associations through high-dimensional confounding adjustment and set enrichment. We aimed to assess the transportability and generalizability of the PharmWAS framework by using medical claims data to reproduce known medication associations with Clostridioides difficile infection (CDI) or gastrointestinal bleeding (GIB). METHODS: We conducted case-control studies using Optum's de-identified Clinformatics Data Mart Database of individuals enrolled in large commercial and Medicare Advantage health plans in the United States. Individuals with CDI (from 2010 to 2015) or GIB (from 2010 to 2021) were matched to controls by age and sex. We identified all medications utilized prior to diagnosis and analysed the association of each with CDI or GIB using conditional logistic regression adjusted for risk factors for the outcome and a high-dimensional propensity score. FINDINGS: For the CDI study, we identified 55,137 cases, 220,543 controls, and 290 medications to analyse. Antibiotics with Gram-negative spectrum, including ciprofloxacin (aOR 2.83), ceftriaxone (aOR 2.65), and levofloxacin (aOR 1.60), were strongly associated. For the GIB study, we identified 450,315 cases, 1,801,260 controls, and 354 medications to analyse. Antiplatelets, anticoagulants, and non-steroidal anti-inflammatory drugs, including ticagrelor (aOR 2.81), naproxen (aOR 1.87), and rivaroxaban (aOR 1.31), were strongly associated. INTERPRETATION: These studies demonstrate the generalizability and transportability of the PharmWAS pharmacovigilance framework. With additional validation, PharmWAS could complement traditional passive surveillance systems to identify medications that unexpectedly provoke or prevent high-impact conditions. FUNDING: U.S. National Institute of Diabetes and Digestive and Kidney Diseases.


Assuntos
Clostridioides difficile , Infecções por Clostridium , Hemorragia Gastrointestinal , Farmacovigilância , Humanos , Infecções por Clostridium/epidemiologia , Infecções por Clostridium/etiologia , Infecções por Clostridium/tratamento farmacológico , Estudos de Casos e Controles , Masculino , Hemorragia Gastrointestinal/induzido quimicamente , Hemorragia Gastrointestinal/epidemiologia , Hemorragia Gastrointestinal/etiologia , Feminino , Idoso , Pessoa de Meia-Idade , Antibacterianos/efeitos adversos , Antibacterianos/uso terapêutico , Estados Unidos/epidemiologia , Fatores de Risco , Adulto , Idoso de 80 Anos ou mais
12.
Neurochem Int ; 176: 105743, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38641026

RESUMO

Neonatal brain inflammation produced by intraperitoneal (i.p.) injection of lipopolysaccharide (LPS) results in long-lasting brain dopaminergic injury and motor disturbances in adult rats. The goal of the present work is to investigate the effect of neonatal systemic LPS exposure (1 or 2 mg/kg, i.p. injection in postnatal day 5, P5, male rats)-induced dopaminergic injury to examine methamphetamine (METH)-induced behavioral sensitization as an indicator of drug addiction. On P70, subjects underwent a treatment schedule of 5 once daily subcutaneous (s.c.) administrations of METH (0.5 mg/kg) (P70-P74) to induce behavioral sensitization. Ninety-six hours following the 5th treatment of METH (P78), the rats received one dose of 0.5 mg/kg METH (s.c.) to reintroduce behavioral sensitization. Hyperlocomotion is a critical index caused by drug abuse, and METH administration has been shown to produce remarkable locomotor-enhancing effects. Therefore, a random forest model was used as the detector to extract the feature interaction patterns among the collected high-dimensional locomotor data. Our approaches identified neonatal systemic LPS exposure dose and METH-treated dates as features significantly associated with METH-induced behavioral sensitization, reinstated behavioral sensitization, and perinatal inflammation in this experimental model of drug addiction. Overall, the analysis suggests that the implementation of machine learning strategies is sensitive enough to detect interaction patterns in locomotor activity. Neonatal LPS exposure also enhanced METH-induced reduction of dopamine transporter expression and [3H]dopamine uptake, reduced mitochondrial complex I activity, and elevated interleukin-1ß and cyclooxygenase-2 concentrations in the P78 rat striatum. These results indicate that neonatal systemic LPS exposure produces a persistent dopaminergic lesion leading to a long-lasting change in the brain reward system as indicated by the enhanced METH-induced behavioral sensitization and reinstated behavioral sensitization later in life. These findings indicate that early-life brain inflammation may enhance susceptibility to drug addiction development later in life, which provides new insights for developing potential therapeutic treatments for drug addiction.


Assuntos
Animais Recém-Nascidos , Lipopolissacarídeos , Aprendizado de Máquina , Metanfetamina , Animais , Metanfetamina/farmacologia , Metanfetamina/toxicidade , Ratos , Masculino , Lipopolissacarídeos/toxicidade , Comportamento Animal/efeitos dos fármacos , Estimulantes do Sistema Nervoso Central/farmacologia , Encefalite/induzido quimicamente , Encefalite/metabolismo , Doenças Neuroinflamatórias/tratamento farmacológico , Doenças Neuroinflamatórias/induzido quimicamente , Doenças Neuroinflamatórias/metabolismo , Locomoção/efeitos dos fármacos , Locomoção/fisiologia , Feminino , Ratos Sprague-Dawley , Atividade Motora/efeitos dos fármacos
13.
J Multivar Anal ; 2022024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38433779

RESUMO

Network estimation has been a critical component of high-dimensional data analysis and can provide an understanding of the underlying complex dependence structures. Among the existing studies, Gaussian graphical models have been highly popular. However, they still have limitations due to the homogeneous distribution assumption and the fact that they are only applicable to small-scale data. For example, cancers have various levels of unknown heterogeneity, and biological networks, which include thousands of molecular components, often differ across subgroups while also sharing some commonalities. In this article, we propose a new joint estimation approach for multiple networks with unknown sample heterogeneity, by decomposing the Gaussian graphical model (GGM) into a collection of sparse regression problems. A reparameterization technique and a composite minimax concave penalty are introduced to effectively accommodate the specific and common information across the networks of multiple subgroups, making the proposed estimator significantly advancing from the existing heterogeneity network analysis based on the regularized likelihood of GGM directly and enjoying scale-invariant, tuning-insensitive, and optimization convexity properties. The proposed analysis can be effectively realized using parallel computing. The estimation and selection consistency properties are rigorously established. The proposed approach allows the theoretical studies to focus on independent network estimation only and has the significant advantage of being both theoretically and computationally applicable to large-scale data. Extensive numerical experiments with simulated data and the TCGA breast cancer data demonstrate the prominent performance of the proposed approach in both subgroup and network identifications.

14.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38465987

RESUMO

High-dimensional data sets are often available in genome-enabled predictions. Such data sets include nonlinear relationships with complex dependence structures. For such situations, vine copula-based (quantile) regression is an important tool. However, the current vine copula-based regression approaches do not scale up to high and ultra-high dimensions. To perform high-dimensional sparse vine copula-based regression, we propose 2 methods. First, we show their superiority regarding computational complexity over the existing methods. Second, we define relevant, irrelevant, and redundant explanatory variables for quantile regression. Then, we show our method's power in selecting relevant variables and prediction accuracy in high-dimensional sparse data sets via simulation studies. Next, we apply the proposed methods to the high-dimensional real data, aiming at the genomic prediction of maize traits. Some data processing and feature extraction steps for the real data are further discussed. Finally, we show the advantage of our methods over linear models and quantile regression forests in simulation studies and real data applications.


Assuntos
Genoma , Genômica , Genômica/métodos , Simulação por Computador , Modelos Lineares , Fenótipo
15.
Metabolomics ; 20(2): 35, 2024 Mar 05.
Artigo em Inglês | MEDLINE | ID: mdl-38441696

RESUMO

INTRODUCTION: Longitudinal biomarkers in patients with community-acquired pneumonia (CAP) may help in monitoring of disease progression and treatment response. The metabolic host response could be a potential source of such biomarkers since it closely associates with the current health status of the patient. OBJECTIVES: In this study we performed longitudinal metabolite profiling in patients with CAP for a comprehensive range of metabolites to identify potential host response biomarkers. METHODS: Previously collected serum samples from CAP patients with confirmed Streptococcus pneumoniae infection (n = 25) were used. Samples were collected at multiple time points, up to 30 days after admission. A wide range of metabolites was measured, including amines, acylcarnitines, organic acids, and lipids. The associations between metabolites and C-reactive protein (CRP), procalcitonin, CURB disease severity score at admission, and total length of stay were evaluated. RESULTS: Distinct longitudinal profiles of metabolite profiles were identified, including cholesteryl esters, diacyl-phosphatidylethanolamine, diacylglycerols, lysophosphatidylcholines, sphingomyelin, and triglycerides. Positive correlations were found between CRP and phosphatidylcholine (34:1) (cor = 0.63) and negative correlations were found for CRP and nine lysophosphocholines (cor = - 0.57 to - 0.74). The CURB disease severity score was negatively associated with six metabolites, including acylcarnitines (tau = - 0.64 to - 0.58). Negative correlations were found between the length of stay and six triglycerides (TGs), especially TGs (60:3) and (58:2) (cor = - 0.63 and - 0.61). CONCLUSION: The identified metabolites may provide insight into biological mechanisms underlying disease severity and may be of interest for exploration as potential treatment response monitoring biomarker.


Assuntos
Pneumonia , Streptococcus pneumoniae , Humanos , Metabolômica , Proteína C-Reativa , Biomarcadores , Triglicerídeos
16.
BMC Genomics ; 25(1): 152, 2024 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-38326768

RESUMO

BACKGROUND: The accurate prediction of genomic breeding values is central to genomic selection in both plant and animal breeding studies. Genomic prediction involves the use of thousands of molecular markers spanning the entire genome and therefore requires methods able to efficiently handle high dimensional data. Not surprisingly, machine learning methods are becoming widely advocated for and used in genomic prediction studies. These methods encompass different groups of supervised and unsupervised learning methods. Although several studies have compared the predictive performances of individual methods, studies comparing the predictive performance of different groups of methods are rare. However, such studies are crucial for identifying (i) groups of methods with superior genomic predictive performance and assessing (ii) the merits and demerits of such groups of methods relative to each other and to the established classical methods. Here, we comparatively evaluate the genomic predictive performance and informally assess the computational cost of several groups of supervised machine learning methods, specifically, regularized regression methods, deep, ensemble and instance-based learning algorithms, using one simulated animal breeding dataset and three empirical maize breeding datasets obtained from a commercial breeding program. RESULTS: Our results show that the relative predictive performance and computational expense of the groups of machine learning methods depend upon both the data and target traits and that for classical regularized methods, increasing model complexity can incur huge computational costs but does not necessarily always improve predictive accuracy. Thus, despite their greater complexity and computational burden, neither the adaptive nor the group regularized methods clearly improved upon the results of their simple regularized counterparts. This rules out selection of one procedure among machine learning methods for routine use in genomic prediction. The results also show that, because of their competitive predictive performance, computational efficiency, simplicity and therefore relatively few tuning parameters, the classical linear mixed model and regularized regression methods are likely to remain strong contenders for genomic prediction. CONCLUSIONS: The dependence of predictive performance and computational burden on target datasets and traits call for increasing investments in enhancing the computational efficiency of machine learning algorithms and computing resources.


Assuntos
Aprendizado Profundo , Animais , Melhoramento Vegetal , Genoma , Genômica/métodos , Aprendizado de Máquina
17.
BMC Bioinformatics ; 25(1): 57, 2024 Feb 05.
Artigo em Inglês | MEDLINE | ID: mdl-38317067

RESUMO

BACKGROUND: Controlling the False Discovery Rate (FDR) in Multiple Comparison Procedures (MCPs) has widespread applications in many scientific fields. Previous studies show that the correlation structure between test statistics increases the variance and bias of FDR. The objective of this study is to modify the effect of correlation in MCPs based on the information theory. We proposed three modified procedures (M1, M2, and M3) under strong, moderate, and mild assumptions based on the conditional Fisher Information of the consecutive sorted test statistics for controlling the false discovery rate under arbitrary correlation structure. The performance of the proposed procedures was compared with the Benjamini-Hochberg (BH) and Benjamini-Yekutieli (BY) procedures in simulation study and real high-dimensional data of colorectal cancer gene expressions. In the simulation study, we generated 1000 differential multivariate Gaussian features with different levels of the correlation structure and screened the significance features by the FDR controlling procedures, with strong control on the Family Wise Error Rates. RESULTS: When there was no correlation between 1000 simulated features, the performance of the BH procedure was similar to the three proposed procedures. In low to medium correlation structures the BY procedure is too conservative. The BH procedure is too liberal, and the mean number of screened features was constant at the different levels of the correlation between features. The mean number of screened features by proposed procedures was between BY and BH procedures and reduced when the correlations increased. Where the features are highly correlated the number of screened features by proposed procedures reached the Bonferroni (BF) procedure, as expected. In real data analysis the BY, BH, M1, M2, and M3 procedures were done to screen gene expressions of colorectal cancer. To fit a predictive model based on the screened features the Efficient Bayesian Logistic Regression (EBLR) model was used. The fitted EBLR models based on the screened features by M1 and M2 procedures have minimum entropies and are more efficient than BY and BH procedures. CONCLUSION: The modified proposed procedures based on information theory, are much more flexible than BH and BY procedures for the amount of correlation between test statistics. The modified procedures avoided screening the non-informative features and so the number of screened features reduced with the increase in the level of correlation.


Assuntos
Neoplasias Colorretais , Teoria da Informação , Humanos , Teorema de Bayes , Genômica , Simulação por Computador
18.
J Appl Stat ; 51(2): 279-297, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38283051

RESUMO

Model averaging (MA) is a modelling strategy where the uncertainty in the configuration of selected variables is taken into account by weight-combining each estimate of the so-called 'candidate model'. Some studies have shown that MA enables better prediction, even in high-dimensional cases. However, little is known about the model prediction performance at different types of multicollinearity in high-dimensional data. Motivated by calibration of near-infrared (NIR) instruments,we focus on MA prediction performance in such data. The weighting schemes that we consider are based on the Akaike's information criterion (AIC), Mallows' Cp, and cross-validation. For estimating the model parameters, we consider the standard least squares and the ridge regression methods. The results indicate that MA outperforms model selection methods such as LASSO and SCAD in high-correlation data. The use of Mallows' Cp and cross-validation for the weights tends to yield similar results in all structures of correlation, although the former is generally preferred. We also find that the ridge model averaging outperforms the least-squares model averaging. This research suggests ridge model averaging to build a relatively better prediction of the NIR calibration model.

19.
Biostatistics ; 25(2): 486-503, 2024 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-36797830

RESUMO

In prospective genomic studies (e.g., DNA methylation, metagenomics, and transcriptomics), it is crucial to estimate the overall fraction of phenotypic variance (OFPV) attributed to the high-dimensional genomic variables, a concept similar to heritability analyses in genome-wide association studies (GWAS). Unlike genetic variants in GWAS, these genomic variables are typically measured with error due to technical limitation and temporal instability. While the existing methods developed for GWAS can be used, ignoring measurement error may severely underestimate OFPV and mislead the design of future studies. Assuming that measurement error variances are distributed similarly between causal and noncausal variables, we show that the asymptotic attenuation factor equals to the average intraclass correlation coefficients of all genomic variables, which can be estimated based on a pilot study with repeated measurements. We illustrate the method by estimating the contribution of microbiome taxa to body mass index and multiple allergy traits in the American Gut Project. Finally, we show that measurement error does not cause meaningful bias when estimating the correlation of effect sizes for two traits.


Assuntos
Estudo de Associação Genômica Ampla , Genoma , Humanos , Estudo de Associação Genômica Ampla/métodos , Projetos Piloto , Estudos Prospectivos , Fenótipo , Polimorfismo de Nucleotídeo Único
20.
Biom J ; 66(1): e2200207, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37421205

RESUMO

Variable selection methods based on L0 penalties have excellent theoretical properties to select sparse models in a high-dimensional setting. There exist modifications of the Bayesian Information Criterion (BIC) which either control the familywise error rate (mBIC) or the false discovery rate (mBIC2) in terms of which regressors are selected to enter a model. However, the minimization of L0 penalties comprises a mixed-integer problem which is known to be NP-hard and therefore becomes computationally challenging with increasing numbers of regressor variables. This is one reason why alternatives like the LASSO have become so popular, which involve convex optimization problems that are easier to solve. The last few years have seen some real progress in developing new algorithms to minimize L0 penalties. The aim of this article is to compare the performance of these algorithms in terms of minimizing L0 -based selection criteria. Simulation studies covering a wide range of scenarios that are inspired by genetic association studies are used to compare the values of selection criteria obtained with different algorithms. In addition, some statistical characteristics of the selected models and the runtime of algorithms are compared. Finally, the performance of the algorithms is illustrated in a real data example concerned with expression quantitative trait loci (eQTL) mapping.


Assuntos
Algoritmos , Locos de Características Quantitativas , Teorema de Bayes , Simulação por Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA