Pesquisa | BVS CLAP/SMR-OPAS/OMS

The Benefits of Permutation-Based Genome-Wide Association Studies.

John, Maura; Korte, Arthur; Grimm, Dominik G.

J Exp Bot ; 2024 Jul 02.

Artigo em Inglês | MEDLINE | ID: mdl-38954539

RESUMO

Linear mixed models (LMMs) are a commonly used method for genome-wide association studies (GWAS) that aim to detect associations between genetic markers and phenotypic measurements in a population of individuals while accounting for population structure and cryptic relatedness. In a standard GWAS, hundreds of thousands to millions of statistical tests are performed, requiring control for multiple hypothesis testing. Typically, static corrections that penalize the number of tests performed are used to control for the family-wise error rate, which is the probability of making at least one false positive. However, it has been shown that in practice this threshold is too conservative for normally distributed phenotypes and not stringent enough for non-normally distributed phenotypes. Therefore, permutation-based LMM approaches have recently been proposed to provide a more realistic threshold that takes phenotypic distributions into account. In this work, we will discuss the advantages of permutation-based GWAS approaches, including new simulations and results from a re-analysis of all publicly available Arabidopsis thaliana phenotypes from the AraPheno database.

Efficient permutation-based genome-wide association studies for normal and skewed phenotypic distributions.

John, Maura; Ankenbrand, Markus J; Artmann, Carolin; Freudenthal, Jan A; Korte, Arthur; Grimm, Dominik G.

Bioinformatics ; 38(Suppl_2): ii5-ii12, 2022 09 16.

Artigo em Inglês | MEDLINE | ID: mdl-36124808

RESUMO

MOTIVATION: Genome-wide association studies (GWAS) are an integral tool for studying the architecture of complex genotype and phenotype relationships. Linear mixed models (LMMs) are commonly used to detect associations between genetic markers and a trait of interest, while at the same time allowing to account for population structure and cryptic relatedness. Assumptions of LMMs include a normal distribution of the residuals and that the genetic markers are independent and identically distributed-both assumptions are often violated in real data. Permutation-based methods can help to overcome some of these limitations and provide more realistic thresholds for the discovery of true associations. Still, in practice, they are rarely implemented due to the high computational complexity. RESULTS: We propose permGWAS, an efficient LMM reformulation based on 4D tensors that can provide permutation-based significance thresholds. We show that our method outperforms current state-of-the-art LMMs with respect to runtime and that permutation-based thresholds have lower false discovery rates for skewed phenotypes compared to the commonly used Bonferroni threshold. Furthermore, using permGWAS we re-analyzed more than 500 Arabidopsis thaliana phenotypes with 100 permutations each in less than 8 days on a single GPU. Our re-analyses suggest that applying a permutation-based threshold can improve and refine the interpretation of GWAS results. AVAILABILITY AND IMPLEMENTATION: permGWAS is open-source and publicly available on GitHub for download: https://github.com/grimmlab/permGWAS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Estudo de Associação Genômica Ampla , Marcadores Genéticos , Estudo de Associação Genômica Ampla/métodos , Genótipo , Modelos Lineares , Fenótipo

Predicting Gene Regulatory Interactions Using Natural Genetic Variation.

John, Maura; Grimm, Dominik; Korte, Arthur.

Methods Mol Biol ; 2698: 301-322, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37682482

RESUMO

Genome-wide association studies (GWAS) are a powerful tool to elucidate the genotype-phenotype map. Although GWAS are usually used to assess simple univariate associations between genetic markers and traits of interest, it is also possible to infer the underlying genetic architecture and to predict gene regulatory interactions. In this chapter, we describe the latest methods and tools to perform GWAS by calculating permutation-based significance thresholds. For this purpose, we first provide guidelines on univariate GWAS analyses that are extended in the second part of this chapter to more complex models that enable the inference of gene regulatory networks and how these networks vary.

Assuntos

Epistasia Genética , Estudo de Associação Genômica Ampla , Redes Reguladoras de Genes , Fenótipo , Variação Genética

easyPheno: An easy-to-use and easy-to-extend Python framework for phenotype prediction using Bayesian optimization.

Haselbeck, Florian; John, Maura; Grimm, Dominik G.

Bioinform Adv ; 3(1): vbad035, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37066135

RESUMO

Summary: Predicting complex traits from genotypic information is a major challenge in various biological domains. With easyPheno, we present a comprehensive Python framework enabling the rigorous training, comparison and analysis of phenotype predictions for a variety of different models, ranging from common genomic selection approaches over classical machine learning and modern deep learning-based techniques. Our framework is easy-to-use, also for non-programming-experts, and includes an automatic hyperparameter search using state-of-the-art Bayesian optimization. Moreover, easyPheno provides various benefits for bioinformaticians developing new prediction models. easyPheno enables to quickly integrate novel models and functionalities in a reliable framework and to benchmark against various integrated prediction models in a comparable setup. In addition, the framework allows the assessment of newly developed prediction models under pre-defined settings using simulated data. We provide a detailed documentation with various hands-on tutorials and videos explaining the usage of easyPheno to novice users. Availability and implementation: easyPheno is publicly available at https://github.com/grimmlab/easyPheno and can be easily installed as Python package via https://pypi.org/project/easypheno/ or using Docker. A comprehensive documentation including various tutorials complemented with videos can be found at https://easypheno.readthedocs.io/. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

Superior protein thermophilicity prediction with protein language model embeddings.

Haselbeck, Florian; John, Maura; Zhang, Yuqi; Pirnay, Jonathan; Fuenzalida-Werner, Juan Pablo; Costa, Rubén D; Grimm, Dominik G.

NAR Genom Bioinform ; 5(4): lqad087, 2023 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-37829176

RESUMO

Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew's correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.

A comparison of classical and machine learning-based phenotype prediction methods on simulated data and three plant species.

John, Maura; Haselbeck, Florian; Dass, Rupashree; Malisi, Christoph; Ricca, Patrizia; Dreischer, Christian; Schultheiss, Sebastian J; Grimm, Dominik G.

Front Plant Sci ; 13: 932512, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36407627

RESUMO

Genomic selection is an integral tool for breeders to accurately select plants directly from genotype data leading to faster and more resource-efficient breeding programs. Several prediction methods have been established in the last few years. These range from classical linear mixed models to complex non-linear machine learning approaches, such as Support Vector Regression, and modern deep learning-based architectures. Many of these methods have been extensively evaluated on different crop species with varying outcomes. In this work, our aim is to systematically compare 12 different phenotype prediction models, including basic genomic selection methods to more advanced deep learning-based techniques. More importantly, we assess the performance of these models on simulated phenotype data as well as on real-world data from Arabidopsis thaliana and two breeding datasets from soy and corn. The synthetic phenotypic data allow us to analyze all prediction models and especially the selected markers under controlled and predefined settings. We show that Bayes B and linear regression models with sparsity constraints perform best under different simulation settings with respect to explained variance. Further, we can confirm results from other studies that there is no superiority of more complex neural network-based architectures for phenotype prediction compared to well-established methods. However, on real-world data, for which several prediction models yield comparable results with slight advantages for Elastic Net, this picture is less clear, suggesting that there is a lot of room for future research.

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA