Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 121
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37225428

RESUMO

The prediction of drug-drug interactions (DDIs) is essential for the development and repositioning of new drugs. Meanwhile, they play a vital role in the fields of biopharmaceuticals, disease diagnosis and pharmacological treatment. This article proposes a new method called DBGRU-SE for predicting DDIs. Firstly, FP3 fingerprints, MACCS fingerprints, Pubchem fingerprints and 1D and 2D molecular descriptors are used to extract the feature information of the drugs. Secondly, Group Lasso is used to remove redundant features. Then, SMOTE-ENN is applied to balance the data to obtain the best feature vectors. Finally, the best feature vectors are fed into the classifier combining BiGRU and squeeze-and-excitation (SE) attention mechanisms to predict DDIs. After applying five-fold cross-validation, The ACC values of DBGRU-SE model on the two datasets are 97.51 and 94.98%, and the AUC are 99.60 and 98.85%, respectively. The results showed that DBGRU-SE had good predictive performance for drug-drug interactions.


Assuntos
Biologia Computacional , Interações Medicamentosas , Biologia Computacional/métodos
2.
BMC Bioinformatics ; 25(1): 283, 2024 Aug 29.
Artigo em Inglês | MEDLINE | ID: mdl-39210319

RESUMO

BACKGROUND: Copy number variants (CNVs) have become increasingly instrumental in understanding the etiology of all diseases and phenotypes, including Neurocognitive Disorders (NDs). Among the well-established regions associated with ND are small parts of chromosome 16 deletions (16p11.2) and chromosome 15 duplications (15q3). Various methods have been developed to identify associations between CNVs and diseases of interest. The majority of methods are based on statistical inference techniques. However, due to the multi-dimensional nature of the features of the CNVs, these methods are still immature. The other aspect is that regions discovered by different methods are large, while the causative regions may be much smaller. RESULTS: In this study, we propose a regularized deep learning model to select causal regions for the target disease. With the help of the proximal [20] gradient descent algorithm, the model utilizes the group LASSO concept and embraces a deep learning model in a sparsity framework. We perform the CNV analysis for 74,811 individuals with three types of brain disorders, autism spectrum disorder (ASD), schizophrenia (SCZ), and developmental delay (DD), and also perform cumulative analysis to discover the regions that are common among the NDs. The brain expression of genes associated with diseases has increased by an average of 20 percent, and genes with homologs in mice that cause nervous system phenotypes have increased by 18 percent (on average). The DECIPHER data source also seeks other phenotypes connected to the detected regions alongside gene ontology analysis. The target diseases are correlated with some unexplored regions, such as deletions on 1q21.1 and 1q21.2 (for ASD), deletions on 20q12 (for SCZ), and duplications on 8p23.3 (for DD). Furthermore, our method is compared with other machine learning algorithms. CONCLUSIONS: Our model effectively identifies regions associated with phenotypic traits using regularized deep learning. Rather than attempting to analyze the whole genome, CNVDeep allows us to focus only on the causative regions of disease.


Assuntos
Variações do Número de Cópias de DNA , Aprendizado Profundo , Esquizofrenia , Variações do Número de Cópias de DNA/genética , Humanos , Esquizofrenia/genética , Transtornos Neurocognitivos/genética , Transtorno do Espectro Autista/genética , Algoritmos , Deficiências do Desenvolvimento/genética , Deleção Cromossômica , Cromossomos Humanos Par 16/genética , Cromossomos Humanos Par 15/genética
3.
Biom J ; 66(4): e2200334, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38747086

RESUMO

Many data sets exhibit a natural group structure due to contextual similarities or high correlations of variables, such as lipid markers that are interrelated based on biochemical principles. Knowledge of such groupings can be used through bi-level selection methods to identify relevant feature groups and highlight their predictive members. One of the best known approaches of this kind combines the classical Least Absolute Shrinkage and Selection Operator (LASSO) with the Group LASSO, resulting in the Sparse Group LASSO. We propose the Sparse Group Penalty (SGP) framework, which allows for a flexible combination of different SGL-style shrinkage conditions. Analogous to SGL, we investigated the combination of the Smoothly Clipped Absolute Deviation (SCAD), the Minimax Concave Penalty (MCP) and the Exponential Penalty (EP) with their group versions, resulting in the Sparse Group SCAD, the Sparse Group MCP, and the novel Sparse Group EP (SGE). Those shrinkage operators provide refined control of the effect of group formation on the selection process through a tuning parameter. In simulation studies, SGPs were compared with other bi-level selection methods (Group Bridge, composite MCP, and Group Exponential LASSO) for variable and group selection evaluated with the Matthews correlation coefficient. We demonstrated the advantages of the new SGE in identifying parsimonious models, but also identified scenarios that highlight the limitations of the approach. The performance of the techniques was further investigated in a real-world use case for the selection of regulated lipids in a randomized clinical trial.


Assuntos
Biometria , Biometria/métodos , Humanos
4.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34086850

RESUMO

For high-dimensional expression data, most prognostic models perform feature selection based on individual genes, which usually lead to unstable prognosis, and the identified risk genes are inherently insufficient in revealing complex molecular mechanisms. Since most genes carry out cellular functions by forming protein complexes-basic representatives of functional modules, identifying risk protein complexes may greatly improve our understanding of disease biology. Coupled with the fact that protein complexes have been shown to have innate resistance to batch effects and are effective predictors of disease phenotypes, constructing prognostic models and selecting features with protein complexes as the basic unit should improve the robustness and biological interpretability of the model. Here, we propose a protein complex-based, group lasso-Cox model (PCLasso) to predict patient prognosis and identify risk protein complexes. Experiments on three cancer types have proved that PCLasso has better prognostic performance than prognostic models based on individual genes. The resulting risk protein complexes not only contain individual risk genes but also incorporate close partners that synergize with them, which may promote the revealing of molecular mechanisms related to cancer progression from a comprehensive perspective. Furthermore, a pan-cancer prognostic analysis was performed to identify risk protein complexes of 19 cancer types, which may provide novel potential targets for cancer research.


Assuntos
Algoritmos , Biomarcadores , Biologia Computacional/métodos , Complexos Multiproteicos/metabolismo , Modelos de Riscos Proporcionais , Biomarcadores Tumorais , Bases de Dados Genéticas , Regulação Neoplásica da Expressão Gênica , Humanos , Neoplasias/diagnóstico , Neoplasias/etiologia , Neoplasias/metabolismo , Neoplasias/mortalidade , Prognóstico , Reprodutibilidade dos Testes , Medição de Risco , Análise de Sobrevida
5.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34254998

RESUMO

Statistical analysis of ultrahigh-dimensional omics scale data has long depended on univariate hypothesis testing. With growing data features and samples, the obvious next step is to establish multivariable association analysis as a routine method to describe genotype-phenotype association. Here we present ParProx, a state-of-the-art implementation to optimize overlapping and non-overlapping group lasso regression models for time-to-event and classification analysis, with selection of variables grouped by biological priors. ParProx enables multivariable model fitting for ultrahigh-dimensional data within an architecture for parallel or distributed computing via latent variable group representation. It thereby aims to produce interpretable regression models consistent with known biological relationships among independent variables, a property often explored post hoc, not during model estimation. Simulation studies clearly demonstrate the scalability of ParProx with graphics processing units in comparison to existing implementations. We illustrate the tool using three different omics data sets featuring moderate to large numbers of variables, where we use genomic regions and biological pathways as variable groups, rendering the selected independent variables directly interpretable with respect to those groups. ParProx is applicable to a wide range of studies using ultrahigh-dimensional omics data, from genome-wide association analysis to multi-omics studies where model estimation is computationally intractable with existing implementation.


Assuntos
Algoritmos , Biologia Computacional/métodos , Genômica/métodos , Análise de Regressão , Software , Biomarcadores , Suscetibilidade a Doenças , Perfilação da Expressão Gênica , Humanos , Mutação , Prognóstico , Modelos de Riscos Proporcionais , Mapeamento de Interação de Proteínas
6.
Stat Med ; 42(10): 1625-1639, 2023 05 10.
Artigo em Inglês | MEDLINE | ID: mdl-36822218

RESUMO

We focus on identifying genomics risk factors of higher body mass index (BMI) incorporating a priori information, such as biological pathways. However, the commonly used methods to incorporate prior information provide a model for the mean function of the outcome and rely on unmet assumptions. To address these concerns, we propose a method for nonparametric additive quantile regression with network regularization to incorporate the information encoded by known networks. To account for nonlinear associations, we approximate the unknown additive functional effect of each predictor with the expansion of a B-spline basis. We implement the group Lasso penalty to obtain a sparse model. We define the network-constrained penalty by the total ℓ 2 $$ {\ell}_2 $$ norm of the difference between the effect functions of any two linked genes in the known network. We further propose an efficient computation procedure to solve the optimization problem that arises in our model. Simulation studies show that our proposed method performs well in identifying more truly associated genes and less falsely associated genes than alternative approaches. We apply the proposed method to analyze the microarray gene-expression dataset in the Framingham Heart Study and identify several 75 percentile BMI associated genes. In conclusion, our proposed approach efficiently identifies the outcome-associated variables in a nonparametric additive quantile regression framework by leveraging known network information.


Assuntos
Genômica , Humanos , Índice de Massa Corporal , Simulação por Computador
7.
BMC Med Res Methodol ; 23(1): 254, 2023 10 28.
Artigo em Inglês | MEDLINE | ID: mdl-37898791

RESUMO

BACKGROUND: A substantial body of clinical research involving individuals infected with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has evaluated the association between in-hospital biomarkers and severe SARS-CoV-2 outcomes, including intubation and death. However, most existing studies considered each of multiple biomarkers independently and focused analysis on baseline or peak values. METHODS: We propose a two-stage analytic strategy combining functional principal component analysis (FPCA) and sparse-group LASSO (SGL) to characterize associations between biomarkers and 30-day mortality rates. Unlike prior reports, our proposed approach leverages: 1) time-varying biomarker trajectories, 2) multiple biomarkers simultaneously, and 3) the pathophysiological grouping of these biomarkers. We apply this method to a retrospective cohort of 12, 941 patients hospitalized at Massachusetts General Hospital or Brigham and Women's Hospital and conduct simulation studies to assess performance. RESULTS: Renal, inflammatory, and cardio-thrombotic biomarkers were associated with 30-day mortality rates among hospitalized SARS-CoV-2 patients. Sex-stratified analysis revealed that hematogolical biomarkers were associated with higher mortality in men while this association was not identified in women. In simulation studies, our proposed method maintained high true positive rates and outperformed alternative approaches using baseline or peak values only with respect to false positive rates. CONCLUSIONS: The proposed two-stage approach is a robust strategy for identifying biomarkers that associate with disease severity among SARS-CoV-2-infected individuals. By leveraging information on multiple, grouped biomarkers' longitudinal trajectories, our method offers an important first step in unraveling disease etiology and defining meaningful risk strata.


Assuntos
COVID-19 , SARS-CoV-2 , Masculino , Humanos , Feminino , Estudos Retrospectivos , Análise de Componente Principal , Hospitalização , Biomarcadores
8.
BMC Health Serv Res ; 23(1): 1419, 2023 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-38102614

RESUMO

BACKGROUND: Risk-adjustment (RA) models are used to account for severity of illness in comparing patient outcomes across hospitals. Researchers specify covariates as main effects, but they often ignore interactions or use stratification to account for effect modification, despite limitations due to rare events and sparse data. Three Agency for Healthcare Research and Quality (AHRQ) hospital-level Quality Indicators currently use stratified models, but their variable performance and limited interpretability motivated the design of better models. METHODS: We analysed patient discharge de-identified data from 14 State Inpatient Databases, AHRQ Healthcare Cost and Utilization Project, California Department of Health Care Access and Information, and New York State Department of Health. We used hierarchical group lasso regularisation (HGLR) to identify first-order interactions in several AHRQ inpatient quality indicators (IQI) - IQI 09 (Pancreatic Resection Mortality Rate), IQI 11 (Abdominal Aortic Aneurysm Repair Mortality Rate), and Patient Safety Indicator 14 (Postoperative Wound Dehiscence Rate). These models were compared with stratum-specific and composite main effects models with covariates selected by least absolute shrinkage and selection operator (LASSO). RESULTS: HGLR identified clinically meaningful interactions for all models. Synergistic IQI 11 interactions, such as between hypertension and respiratory failure, suggest patients who merit special attention in perioperative care. Antagonistic IQI 11 interactions, such as between shock and chronic comorbidities, illustrate that naïve main effects models overestimate risk in key subpopulations. Interactions for PSI 14 suggest key subpopulations for whom the risk of wound dehiscence is similar between open and laparoscopic approaches, whereas laparoscopic approach is safer for other groups. Model performance was similar or superior for composite models with HGLR-selected features, compared to those with LASSO-selected features. CONCLUSIONS: In this application to high-profile, high-stakes risk-adjustment models, HGLR selected interactions that maintained or improved model performance in populations with heterogeneous risk, while identifying clinically important interactions. The HGLR package is scalable to handle a large number of covariates and their interactions and is customisable to use multiple CPU cores to reduce analysis time. The HGLR method will allow scholars to avoid creating stratified models on sparse data, improve model calibration, and reduce bias. Future work involves testing using other combinations of risk factors, such as vital signs and laboratory values. Our study focuses on a real-world problem of considerable importance to hospitals and policy-makers who must use RA models for statutorily mandated public reporting and payment programmes.


Assuntos
Hospitais , Hipertensão , Humanos , Risco Ajustado , Fatores de Risco , New York
9.
Biom J ; 65(2): e2100334, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36124712

RESUMO

In cardiovascular disease studies, a large number of risk factors are measured but it often remains unknown whether all of them are relevant variables and whether the impact of these variables is changing with time or remains constant. In addition, more than one kind of cardiovascular disease events can be observed in the same patient and events of different types are possibly correlated. It is expected that different kinds of events are associated with different covariates and the forms of covariate effects also vary between event types. To tackle these problems, we proposed a multistate modeling framework for the joint analysis of multitype recurrent events and terminal event. Model structure selection is performed to identify covariates with time-varying coefficients, time-independent coefficients, and null effects. This helps in understanding the disease process as it can detect relevant covariates and identify the temporal dynamics of the covariate effects. It also provides a more parsimonious model to achieve better risk prediction. The performance of the proposed model and selection method is evaluated in numerical studies and illustrated on a real dataset from the Atherosclerosis Risk in Communities study.


Assuntos
Doenças Cardiovasculares , Modelos Estatísticos , Humanos , Simulação por Computador , Doenças Cardiovasculares/epidemiologia
10.
J Comput Chem ; 43(20): 1342-1354, 2022 07 30.
Artigo em Inglês | MEDLINE | ID: mdl-35656889

RESUMO

Machine learning methods have helped to advance wide range of scientific and technological field in recent years, including computational chemistry. As the chemical systems could become complex with high dimension, feature selection could be critical but challenging to develop reliable machine learning based prediction models, especially for proteins as bio-macromolecules. In this study, we applied sparse group lasso (SGL) method as a general feature selection method to develop classification model for an allosteric protein in different functional states. This results into a much improved model with comparable accuracy (Acc) and only 28 selected features comparing to 289 selected features from a previous study. The Acc achieves 91.50% with 1936 selected feature, which is far higher than that of baseline methods. In addition, grouping protein amino acids into secondary structures provides additional interpretability of the selected features. The selected features are verified as associated with key allosteric residues through comparison with both experimental and computational works about the model protein, and demonstrate the effectiveness and necessity of applying rigorous feature selection and evaluation methods on complex chemical systems.


Assuntos
Aprendizado de Máquina , Proteínas , Algoritmos , Proteínas/química
11.
Stat Med ; 41(19): 3679-3695, 2022 08 30.
Artigo em Inglês | MEDLINE | ID: mdl-35603639

RESUMO

Imbalanced classification has drawn considerable attention in the statistics and machine learning literature. Typically, traditional classification methods often perform poorly when a severely skewed class distribution is observed, not to mention under a high-dimensional longitudinal data structure. Given the ubiquity of big data in modern health research, it is expected that imbalanced classification in disease diagnosis may encounter an additional level of difficulty that is imposed by such a complex data structure. In this article, we propose a nonparametric classification approach for imbalanced data in longitudinal and high-dimensional settings. Technically, the functional principal component analysis is first applied for feature extraction under the longitudinal structure. The univariate exponential loss function coupled with group LASSO penalty is then adopted into the classification procedure in high-dimensional settings. Along with a good improvement in imbalanced classification, our approach provides a meaningful feature selection for interpretation while enjoying a remarkably lower computational complexity. The proposed method is illustrated on the real data application of Alzheimer's disease early detection and its empirical performance in finite sample size is extensively evaluated by simulations.


Assuntos
Aprendizado de Máquina , Projetos de Pesquisa , Algoritmos , Diagnóstico Precoce , Humanos
12.
IEEE Trans Inf Theory ; 68(9): 5975-6002, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-36865503

RESUMO

We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model - an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.

13.
Sensors (Basel) ; 22(8)2022 Apr 13.
Artigo em Inglês | MEDLINE | ID: mdl-35458963

RESUMO

Sparsity-based methods have recently come to the foreground of damage detection applications posing a robust and efficient alternative for traditional approaches. At the same time, low-frequency inspection is known to enable global monitoring with waves propagating over large distances. In this paper, a single sensor complex Group Lasso methodology for the problem of structural defect localization by means of compressive sensing and complex low-frequency response functions is presented. The complex Group Lasso methodology is evaluated on composite plates with induced scatterers. An adaptive setting of the methodology is also proposed to further enhance resolution. Results from both approaches are compared with a full-array, super-resolution MUSIC technique of the same signal model. Both algorithms are shown to demonstrate high and competitive performance.


Assuntos
Algoritmos
14.
Sensors (Basel) ; 23(1)2022 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-36616640

RESUMO

Accurate prediction of aviation safety levels is significant for the efficient early warning and prevention of incidents. However, the causal mechanism and temporal character of aviation accidents are complex and not fully understood, which increases the operation cost of accurate aviation safety prediction. This paper adopts an innovative statistical method involving a least absolute shrinkage and selection operator (LASSO) and long short-term memory (LSTM). We compiled and calculated 138 monthly aviation insecure events collected from the Aviation Safety Reporting System (ASRS) and took minor accidents as the predictor. Firstly, this paper introduced the group variables and the weight matrix into LASSO to realize the adaptive variable selection. Furthermore, it took the selected variable into multistep stacked LSTM (MSSLSTM) to predict the monthly accidents in 2020. Finally, the proposed method was compared with multiple existing variable selection and prediction methods. The results demonstrate that the RMSE (root mean square error) of the MSSLSTM is reduced by 41.98%, compared with the original model; on the other hand, the key variable selected by the adaptive spare group lasso (ADSGL) can reduce the elapsed time by 42.67% (13 s). This shows that aviation safety prediction based on ADSGL and MSSLSTM can improve the prediction efficiency of the model while keeping excellent generalization ability and robustness.


Assuntos
Acidentes Aeronáuticos , Aviação , Acidentes , Acidentes Aeronáuticos/prevenção & controle
15.
Sensors (Basel) ; 22(15)2022 Aug 03.
Artigo em Inglês | MEDLINE | ID: mdl-35957349

RESUMO

To date, many machine learning models have been used for peach maturity prediction using non-destructive data, but no performance comparison of the models on these datasets has been conducted. In this study, eight machine learning models were trained on a dataset containing data from 180 'Suncrest' peaches. Before the models were trained, the dataset was subjected to dimensionality reduction using the least absolute shrinkage and selection operator (LASSO) regularization, and 8 input variables (out of 29) were chosen. At the same time, a subgroup consisting of the peach ground color measurements was singled out by dividing the set of variables into three subgroups and by using group LASSO regularization. This type of variable subgroup selection provided valuable information on the contribution of specific groups of peach traits to the maturity prediction. The area under the receiver operating characteristic curve (AUC) values of the selected models were compared, and the artificial neural network (ANN) model achieved the best performance, with an average AUC of 0.782. The second-best machine learning model was linear discriminant analysis with an AUC of 0.766, followed by logistic regression, gradient boosting machine, random forest, support vector machines, a classification and regression trees model, and k-nearest neighbors. Although the primary parameter used to determine the performance of the model was AUC, accuracy, F1 score, and kappa served as control parameters and ultimately confirmed the obtained results. By outperforming other models, ANN proved to be the most accurate model for peach maturity prediction on the given dataset.


Assuntos
Prunus persica , Modelos Logísticos , Aprendizado de Máquina , Redes Neurais de Computação , Máquina de Vetores de Suporte
16.
BMC Bioinformatics ; 22(1): 79, 2021 Feb 19.
Artigo em Inglês | MEDLINE | ID: mdl-33607943

RESUMO

BACKGROUND: Linkage and linkage disequilibrium (LD) between genome regions cause dependencies among genomic markers. Due to family stratification in populations with non-random mating in livestock or crop, the standard measures of population LD such as [Formula: see text] may be biased. Grouping of markers according to their interdependence needs to account for the actual population structure in order to allow proper inference in genome-based evaluations. RESULTS: Given a matrix reflecting the strength of association between markers, groups are built successively using a greedy algorithm; largest groups are built at first. As an option, a representative marker is selected for each group. We provide an implementation of the grouping approach as a new function to the R package hscovar. This package enables the calculation of the theoretical covariance between biallelic markers for half- or full-sib families and the derivation of representative markers. In case studies, we have shown that the number of groups comprising dependent markers was smaller and representative SNPs were spread more uniformly over the investigated chromosome region when the family stratification was respected compared to a population-LD approach. In a simulation study, we observed that sensitivity and specificity of a genome-based association study improved if selection of representative markers took family structure into account. CONCLUSIONS: Chromosome segments which frequently recombine in the underlying population can be identified from the matrix of pairwise dependence between markers. Representative markers can be exploited, for instance, for dimension reduction prior to a genome-based association study or the grouping structure itself can be employed in a grouped penalization approach.


Assuntos
Genoma , Ligação Genética , Genômica , Humanos , Desequilíbrio de Ligação , Polimorfismo de Nucleotídeo Único
17.
J Proteome Res ; 20(6): 3204-3213, 2021 06 04.
Artigo em Inglês | MEDLINE | ID: mdl-34002606

RESUMO

Metabolite set enrichment analysis (MSEA) has gained increasing research interest for identification of perturbed metabolic pathways in metabolomics. The method incorporates predefined metabolic pathways information in the analysis where metabolite sets are typically assumed to be mutually exclusive to each other. However, metabolic pathways are known to contain common metabolites and intermediates. This situation, along with limitations in metabolite detection or coverage leads to overlapping, incomplete metabolite sets in pathway analysis. For overlapping metabolite sets, MSEA tends to result in high false positives due to improper weights allocated to the overlapping metabolites. Here, we proposed an extended partial least squares (PLS) model with a new sparse scheme for overlapping metabolite set enrichment analysis, named overlapping group PLS (ogPLS) analysis. The weight vector of the ogPLS model was decomposed into pathway-specific subvectors, and then a group lasso penalty was imposed on these subvectors to achieve a proper weight allocation for the overlapping metabolites. Two strategies were adopted in the proposed ogPLS model to identify the perturbed metabolic pathways. The first strategy involves debiasing regularization, which was used to reduce inequalities amongst the predefined metabolic pathways. The second strategy is stable selection, which was used to rank pathways while avoiding the nuisance problems of model parameter optimization. Both simulated and real-world metabolomic datasets were used to evaluate the proposed method and compare with two other MSEA methods including Global-test and the multiblock PLS (MB-PLS)-based pathway importance in projection (PIP) methods. Using a simulated dataset with known perturbed pathways, the average true discovery rate for the ogPLS method was found to be higher than the Global-test and the MB-PLS-based PIP methods. Analysis with a real-world metabolomics dataset also indicated that the developed method was less prone to select pathways with highly overlapped detected metabolite sets. Compared with the two other methods, the proposed method features higher accuracy, lower false-positive rate, and is more robust when applied to overlapping metabolite set analysis. The developed ogPLS method may serve as an alternative MSEA method to facilitate biological interpretation of metabolomics data for overlapping metabolite sets.


Assuntos
Redes e Vias Metabólicas , Metabolômica , Análise dos Mínimos Quadrados
18.
Genet Epidemiol ; 44(5): 408-424, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32342572

RESUMO

Mediation analysis attempts to determine whether the relationship between an independent variable (e.g., exposure) and an outcome variable can be explained, at least partially, by an intermediate variable, called a mediator. Most methods for mediation analysis focus on one mediator at a time, although multiple mediators can be jointly analyzed by structural equation models (SEMs) that account for correlations among the mediators. We extend the use of SEMs for the analysis of multiple mediators by creating a sparse group lasso penalized model such that the penalty considers the natural groupings of parameters that determine mediation, as well as encourages sparseness of the model parameters. This provides a way to simultaneously evaluate many mediators and select those that have the most impact, a feature of modern penalized models. Simulations are used to illustrate the benefits and limitations of our approach, and application to a study of DNA methylation and reactive cortisol stress following childhood trauma discovered two novel methylation loci that mediate the association of childhood trauma scores with reactive cortisol stress levels. Our new methods are incorporated into R software called regmed.


Assuntos
Metilação de DNA , Modelos Genéticos , Modelos Estatísticos , Software , Criança , Biologia Computacional , Simulação por Computador , Humanos , Hidrocortisona/metabolismo , Ferimentos e Lesões/metabolismo
19.
Biometrics ; 77(4): 1445-1455, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-32914442

RESUMO

It is increasingly common clinically for cancer specimens to be examined using techniques that identify somatic mutations. In principle, these mutational profiles can be used to diagnose the tissue of origin, a critical task for the 3% to 5% of tumors that have an unknown primary site. Diagnosis of primary site is also critical for screening tests that employ circulating DNA. However, most mutations observed in any new tumor are very rarely occurring mutations, and indeed the preponderance of these may never have been observed in any previous recorded tumor. To create a viable diagnostic tool we need to harness the information content in this "hidden genome" of variants for which no direct information is available. To accomplish this we propose a multilevel meta-feature regression to extract the critical information from rare variants in the training data in a way that permits us to also extract diagnostic information from any previously unobserved variants in the new tumor sample. A scalable implementation of the model is obtained by combining a high-dimensional feature screening approach with a group-lasso penalized maximum likelihood approach based on an equivalent mixed-effect representation of the multilevel model. We apply the method to the Cancer Genome Atlas whole-exome sequencing data set including 3702 tumor samples across seven common cancer sites. Results show that our multilevel approach can harness substantial diagnostic information from the hidden genome.


Assuntos
Neoplasias , Humanos , Funções Verossimilhança , Mutação , Neoplasias/diagnóstico , Neoplasias/genética , Sequenciamento do Exoma/métodos
20.
Stat Med ; 40(20): 4473-4491, 2021 09 10.
Artigo em Inglês | MEDLINE | ID: mdl-34031919

RESUMO

This article concerns robust modeling of the survival time for cancer patients. Accurate prediction of patient survival time is crucial to the development of effective therapeutic strategies. To this goal, we propose a unified Expectation-Maximization approach combined with the L1 -norm penalty to perform variable selection and parameter estimation simultaneously in the accelerated failure time model with right-censored survival data of moderate sizes. Our approach accommodates general loss functions, and reduces to the well-known Buckley-James method when the squared-error loss is used without regularization. To mitigate the effects of outliers and heavy-tailed noise in real applications, we recommend the use of robust loss functions under the general framework. Furthermore, our approach can be extended to incorporate group structure among covariates. We conduct extensive simulation studies to assess the performance of the proposed methods with different loss functions and apply them to an ovarian carcinoma study as an illustration.


Assuntos
Simulação por Computador , Neoplasias/mortalidade , Humanos , Análise de Sobrevida
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA