Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
1.
BMC Bioinformatics ; 24(1): 399, 2023 Oct 26.
Artículo en Inglés | MEDLINE | ID: mdl-37884874

RESUMEN

BACKGROUND: We consider two key problems in genomics involving multiple traits: multi-trait genome wide association studies (GWAS), where the goal is to detect genetic variants associated with the traits; and multi-trait genomic selection (GS), where the emphasis is on accurately predicting trait values. Multi-trait linear mixed models build on the linear mixed model to jointly model multiple traits. Existing estimation methods, however, are limited to the joint analysis of a small number of genotypes; in fact, most approaches consider one SNP at a time. Estimating multi-dimensional genetic and environment effects also results in considerable computational burden. Efficient approaches that incorporate regularization into multi-trait linear models (no random effects) have been recently proposed to identify genomic loci associated with multiple traits (Yu et al. in Multitask learning using task clustering with applications to predictive modeling and GWAS of plant varieties. arXiv:1710.01788 , 2017; Yu et al in Front Big Data 2:27, 2019), but these ignore population structure and familial relatedness (Yu et al in Nat Genet 38:203-208, 2006). RESULTS: This work addresses this gap by proposing a novel class of regularized multi-trait linear mixed models along with scalable approaches for estimation in the presence of high-dimensional genotypes and a large number of traits. We evaluate the effectiveness of the proposed methods using datasets in maize and sorghum diversity panels, and demonstrate benefits in both achieving high prediction accuracy in GS and in identifying relevant marker-trait associations. CONCLUSIONS: The proposed regularized multivariate linear mixed models are relevant for both GWAS and GS. We hope that they will facilitate agronomy-related research in plant biology and crop breeding endeavors.


Asunto(s)
Estudio de Asociación del Genoma Completo , Fitomejoramiento , Estudio de Asociación del Genoma Completo/métodos , Modelos Lineales , Fenotipo , Genómica/métodos , Productos Agrícolas , Polimorfismo de Nucleótido Simple , Modelos Genéticos
2.
Bioinformatics ; 25(12): i110-8, 2009 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-19477976

RESUMEN

We consider the problem of discovering gene regulatory networks from time-series microarray data. Recently, graphical Granger modeling has gained considerable attention as a promising direction for addressing this problem. These methods apply graphical modeling methods on time-series data and invoke the notion of 'Granger causality' to make assertions on causality through inference on time-lagged effects. Existing algorithms, however, have neglected an important aspect of the problem--the group structure among the lagged temporal variables naturally imposed by the time series they belong to. Specifically, existing methods in computational biology share this shortcoming, as well as additional computational limitations, prohibiting their effective applications to the large datasets including a large number of genes and many data points. In the present article, we propose a novel methodology which we term 'grouped graphical Granger modeling method', which overcomes the limitations mentioned above by applying a regression method suited for high-dimensional and large data, and by leveraging the group structure among the lagged temporal variables according to the time series they belong to. We demonstrate the effectiveness of the proposed methodology on both simulated and actual gene expression data, specifically the human cancer cell (HeLa S3) cycle data. The simulation results show that the proposed methodology generally exhibits higher accuracy in recovering the underlying causal structure. Those on the gene expression data demonstrate that it leads to improved accuracy with respect to prediction of known links, and also uncovers additional causal relationships uncaptured by earlier works.


Asunto(s)
Biología Computacional/métodos , Redes Reguladoras de Genes , Perfilación de la Expresión Génica/métodos , Células HeLa , Humanos
3.
Front Big Data ; 2: 27, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-33693350

RESUMEN

We consider multi-response and multi-task regression models, where the parameter matrix to be estimated is expected to have an unknown grouping structure. The groupings can be along tasks, or features, or both, the last one indicating a bi-cluster or "checkerboard" structure. Discovering this grouping structure along with parameter inference makes sense in several applications, such as multi-response Genome-Wide Association Studies (GWAS). By inferring this additional structure we can obtain valuable information on the underlying data mechanisms (e.g., relationships among genotypes and phenotypes in GWAS). In this paper, we propose two formulations to simultaneously learn the parameter matrix and its group structures, based on convex regularization penalties. We present optimization approaches to solve the resulting problems and provide numerical convergence guarantees. Extensive experiments demonstrate much better clustering quality compared to other methods, and our approaches are also validated on real datasets concerning phenotypes and genotypes of plant varieties.

4.
Nat Commun ; 9(1): 542, 2018 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-29416042

RESUMEN

The metazoan genome is compartmentalized in areas of highly interacting chromatin known as topologically associating domains (TADs). TADs are demarcated by boundaries mostly conserved across cell types and even across species. However, a genome-wide characterization of TAD boundary strength in mammals is still lacking. In this study, we first use fused two-dimensional lasso as a machine learning method to improve Hi-C contact matrix reproducibility, and, subsequently, we categorize TAD boundaries based on their insulation score. We demonstrate that higher TAD boundary insulation scores are associated with elevated CTCF levels and that they may differ across cell types. Intriguingly, we observe that super-enhancers are preferentially insulated by strong boundaries. Furthermore, we demonstrate that strong TAD boundaries and super-enhancer elements are frequently co-duplicated in cancer patients. Taken together, our findings suggest that super-enhancers insulated by strong TAD boundaries may be exploited, as a functional unit, by cancer cells to promote oncogenesis.


Asunto(s)
Elementos de Facilitación Genéticos/genética , Regulación Neoplásica de la Expresión Génica/genética , Elementos Aisladores/genética , Neoplasias/genética , Animales , Factor de Unión a CCCTC , Cromatina , Epigenómica , Humanos , Aprendizaje Automático
5.
J Comput Biol ; 23(5): 372-89, 2016 05.
Artículo en Inglés | MEDLINE | ID: mdl-27159633

RESUMEN

Genome-wide association studies have revealed individual genetic variants associated with phenotypic traits such as disease risk and gene expressions. However, detecting pairwise interaction effects of genetic variants on traits still remains a challenge due to a large number of combinations of variants (∼10(11) SNP pairs in the human genome), and relatively small sample sizes (typically <10(4)). Despite recent breakthroughs in detecting interaction effects, there are still several open problems, including: (1) how to quickly process a large number of SNP pairs, (2) how to distinguish between true signals and SNPs/SNP pairs merely correlated with true signals, (3) how to detect nonlinear associations between SNP pairs and traits given small sample sizes, and (4) how to control false positives. In this article, we present a unified framework, called SPHINX, which addresses the aforementioned challenges. We first propose a piecewise linear model for interaction detection, because it is simple enough to estimate model parameters given small sample sizes but complex enough to capture nonlinear interaction effects. Then, based on the piecewise linear model, we introduce randomized group lasso under stability selection, and a screening algorithm to address the statistical and computational challenges mentioned above. In our experiments, we first demonstrate that SPHINX achieves better power than existing methods for interaction detection under false positive control. We further applied SPHINX to late-onset Alzheimer's disease dataset, and report 16 SNPs and 17 SNP pairs associated with gene traits. We also present a highly scalable implementation of our screening algorithm, which can screen ∼118 billion candidates of associations on a 60-node cluster in <5.5 hours.


Asunto(s)
Enfermedad de Alzheimer/genética , Estudio de Asociación del Genoma Completo/métodos , Polimorfismo de Nucleótido Simple , Algoritmos , Humanos , Dinámicas no Lineales , Análisis de Regresión
6.
PLoS One ; 10(10): e0138903, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26439851

RESUMEN

Accurate prediction of complex traits based on whole-genome data is a computational problem of paramount importance, particularly to plant and animal breeders. However, the number of genetic markers is typically orders of magnitude larger than the number of samples (p >> n), amongst other challenges. We assessed the effectiveness of a diverse set of state-of-the-art methods on publicly accessible real data. The most surprising finding was that approaches with feature selection performed better than others on average, in contrast to the expectation in the community that variable selection is mostly ineffective, i.e. that it does not improve accuracy of prediction, in spite of p >> n. We observed superior performance despite a somewhat simplistic approach to variable selection, possibly suggesting an inherent robustness. This bodes well in general since the variable selection methods usually improve interpretability without loss of prediction power. Apart from identifying a set of benchmark data sets (including one simulated data), we also discuss the performance analysis for each data set in terms of the input characteristics.


Asunto(s)
Marcadores Genéticos/genética , Modelos Genéticos , Sitios de Carácter Cuantitativo/genética , Algoritmos , Animales , Genoma/genética , Porcinos , Zea mays/genética
7.
J Bioinform Comput Biol ; 9(2): 231-50, 2011 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-21523930

RESUMEN

Many genes and biological processes function in similar ways across different species. Cross-species gene expression analysis, as a powerful tool to characterize the dynamical properties of the cell, has found a number of applications, such as identifying a conserved core set of cell cycle genes. However, to the best of our knowledge, there is limited effort on developing appropriate techniques to capture the causality relations between genes from time-series microarray data across species. In this paper, we present hidden Markov random field regression with L(1) penalty to uncover the regulatory network structure for different species. The algorithm provides a framework for sharing information across species via hidden component graphs and is able to incorporate domain knowledge across species easily. We demonstrate our method on two synthetic datasets and apply it to discover causal graphs from innate immune response data.


Asunto(s)
Redes Reguladoras de Genes , Modelos Genéticos , Algoritmos , Animales , Biología Computacional , Gráficos por Computador , Simulación por Computador , Humanos , Fenómenos Inmunogenéticos , Cadenas de Markov , Ratones , Análisis por Micromatrices/estadística & datos numéricos , Análisis de Regresión , Especificidad de la Especie
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA