Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 100
Filtrar
1.
Am J Hum Genet ; 111(7): 1431-1447, 2024 07 11.
Artigo em Inglês | MEDLINE | ID: mdl-38908374

RESUMO

Methods of estimating polygenic scores (PGSs) from genome-wide association studies are increasingly utilized. However, independent method evaluation is lacking, and method comparisons are often limited. Here, we evaluate polygenic scores derived via seven methods in five biobank studies (totaling about 1.2 million participants) across 16 diseases and quantitative traits, building on a reference-standardized framework. We conducted meta-analyses to quantify the effects of method choice, hyperparameter tuning, method ensembling, and the target biobank on PGS performance. We found that no single method consistently outperformed all others. PGS effect sizes were more variable between biobanks than between methods within biobanks when methods were well tuned. Differences between methods were largest for the two investigated autoimmune diseases, seropositive rheumatoid arthritis and type 1 diabetes. For most methods, cross-validation was more reliable for tuning hyperparameters than automatic tuning (without the use of target data). For a given target phenotype, elastic net models combining PGS across methods (ensemble PGS) tuned in the UK Biobank provided consistent, high, and cross-biobank transferable performance, increasing PGS effect sizes (ß coefficients) by a median of 5.0% relative to LDpred2 and MegaPRS (the two best-performing single methods when tuned with cross-validation). Our interactively browsable online-results and open-source workflow prspipe provide a rich resource and reference for the analysis of polygenic scoring methods across biobanks.


Assuntos
Bancos de Espécimes Biológicos , Estudo de Associação Genômica Ampla , Herança Multifatorial , Humanos , Herança Multifatorial/genética , Fenótipo , Diabetes Mellitus Tipo 1/genética , Polimorfismo de Nucleotídeo Único , Aprendizado de Máquina
2.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38706320

RESUMO

The advent of rapid whole-genome sequencing has created new opportunities for computational prediction of antimicrobial resistance (AMR) phenotypes from genomic data. Both rule-based and machine learning (ML) approaches have been explored for this task, but systematic benchmarking is still needed. Here, we evaluated four state-of-the-art ML methods (Kover, PhenotypeSeeker, Seq2Geno2Pheno and Aytan-Aktug), an ML baseline and the rule-based ResFinder by training and testing each of them across 78 species-antibiotic datasets, using a rigorous benchmarking workflow that integrates three evaluation approaches, each paired with three distinct sample splitting methods. Our analysis revealed considerable variation in the performance across techniques and datasets. Whereas ML methods generally excelled for closely related strains, ResFinder excelled for handling divergent genomes. Overall, Kover most frequently ranked top among the ML approaches, followed by PhenotypeSeeker and Seq2Geno2Pheno. AMR phenotypes for antibiotic classes such as macrolides and sulfonamides were predicted with the highest accuracies. The quality of predictions varied substantially across species-antibiotic combinations, particularly for beta-lactams; across species, resistance phenotyping of the beta-lactams compound, aztreonam, amoxicillin/clavulanic acid, cefoxitin, ceftazidime and piperacillin/tazobactam, alongside tetracyclines demonstrated more variable performance than the other benchmarked antibiotics. By organism, Campylobacter jejuni and Enterococcus faecium phenotypes were more robustly predicted than those of Escherichia coli, Staphylococcus aureus, Salmonella enterica, Neisseria gonorrhoeae, Klebsiella pneumoniae, Pseudomonas aeruginosa, Acinetobacter baumannii, Streptococcus pneumoniae and Mycobacterium tuberculosis. In addition, our study provides software recommendations for each species-antibiotic combination. It furthermore highlights the need for optimization for robust clinical applications, particularly for strains that diverge substantially from those used for training.


Assuntos
Antibacterianos , Fenótipo , Antibacterianos/farmacologia , Aprendizado de Máquina , Farmacorresistência Bacteriana/genética , Biologia Computacional/métodos , Genoma Bacteriano , Genoma Microbiano , Humanos , Bactérias/genética , Bactérias/efeitos dos fármacos
3.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38349060

RESUMO

The recent development of deep learning methods have undoubtedly led to great improvement in various machine learning tasks, especially in prediction tasks. This type of methods have also been adapted to answer various problems in bioinformatics, including automatic genome annotation, artificial genome generation or phenotype prediction. In particular, a specific type of deep learning method, called graph neural network (GNN) has repeatedly been reported as a good candidate to predict phenotypes from gene expression because its ability to embed information on gene regulation or co-expression through the use of a gene network. However, up to date, no complete and reproducible benchmark has ever been performed to analyze the trade-off between cost and benefit of this approach compared to more standard (and simpler) machine learning methods. In this article, we provide such a benchmark, based on clear and comparable policies to evaluate the different methods on several datasets. Our conclusion is that GNN rarely provides a real improvement in prediction performance, especially when compared to the computation effort required by the methods. Our findings on a limited but controlled simulated dataset shows that this could be explained by the limited quality or predictive power of the input biological gene network itself.


Assuntos
Perfilação da Expressão Gênica , Transcriptoma , Benchmarking , Biologia Computacional , Redes Neurais de Computação
4.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39126426

RESUMO

Navigating the complex landscape of high-dimensional omics data with machine learning models presents a significant challenge. The integration of biological domain knowledge into these models has shown promise in creating more meaningful stratifications of predictor variables, leading to algorithms that are both more accurate and generalizable. However, the wider availability of machine learning tools capable of incorporating such biological knowledge remains limited. Addressing this gap, we introduce BioM2, a novel R package designed for biologically informed multistage machine learning. BioM2 uniquely leverages biological information to effectively stratify and aggregate high-dimensional biological data in the context of machine learning. Demonstrating its utility with genome-wide DNA methylation and transcriptome-wide gene expression data, BioM2 has shown to enhance predictive performance, surpassing traditional machine learning models that operate without the integration of biological knowledge. A key feature of BioM2 is its ability to rank predictor variables within biological categories, specifically Gene Ontology pathways. This functionality not only aids in the interpretability of the results but also enables a subsequent modular network analysis of these variables, shedding light on the intricate systems-level biology underpinning the predictive outcome. We have proposed a biologically informed multistage machine learning framework termed BioM2 for phenotype prediction based on omics data. BioM2 has been incorporated into the BioM2 CRAN package (https://cran.r-project.org/web/packages/BioM2/index.html).


Assuntos
Aprendizado de Máquina , Fenótipo , Humanos , Metilação de DNA , Algoritmos , Biologia Computacional/métodos , Software , Transcriptoma , Genômica/métodos
5.
Annu Rev Genomics Hum Genet ; 23: 591-612, 2022 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-35440148

RESUMO

Ancient DNA provides a powerful window into the biology of extant and extinct species, including humans' closest relatives: Denisovans and Neanderthals. Here, we review what is known about archaic hominin phenotypes from genomic data and how those inferences have been made. We contend that understanding the influence of variants on lower-level molecular phenotypes-such as gene expression and protein function-is a promising approach to using ancient DNA to learn about archaic hominin traits. Molecular phenotypes have simpler genetic architectures than organism-level complex phenotypes, and this approach enables moving beyond association studies by proposing hypotheses about the effects of archaic variants that are testable in model systems. The major challenge to understanding archaic hominin phenotypes is broadening our ability to accurately map genotypes to phenotypes, but ongoing advances ensure that there will be much more to learn about archaic hominin phenotypes from their genomes.


Assuntos
Hominidae , Homem de Neandertal , Animais , DNA Antigo , Genoma Humano , Genômica , Hominidae/genética , Humanos , Homem de Neandertal/genética , Fenótipo
6.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34676389

RESUMO

The employment of doubled-haploid (DH) technology in maize has vastly accelerated the efficiency of developing inbred lines. The selection of superior lines has to rely on genotypes with genomic selection (GS) model, rather than phenotypes due to the high expense of field phenotyping. In this work, we implemented 'genome optimization via virtual simulation (GOVS)' using the genotype and phenotype data of 1404 maize lines and their F1 progeny. GOVS simulates a virtual genome encompassing the most abundant 'optimal genotypes' or 'advantageous alleles' in a genetic pool. Such a virtually optimized genome, although can never be developed in reality, may help plot the optimal route to direct breeding decisions. GOVS assists in the selection of superior lines based on the genomic fragments that a line contributes to the simulated genome. The assumption is that the more fragments of optimal genotypes a line contributes to the assembly, the higher the likelihood of the line favored in the F1 phenotype, e.g. grain yield. Compared to traditional GS method, GOVS-assisted selection may avoid using an arbitrary threshold for the predicted F1 yield to assist selection. Additionally, the selected lines contributed complementary sets of advantageous alleles to the virtual genome. This feature facilitates plotting the optimal route for DH production, whereby the fewest lines and F1 combinations are needed to pyramid a maximum number of advantageous alleles in the new DH lines. In summary, incorporation of DH production, GS and genome optimization will ultimately improve genomically designed breeding in maize. Short abstract: Doubled-haploid (DH) technology has been widely applied in maize breeding industry, as it greatly shortens the period of developing homozygous inbred lines via bypassing several rounds of self-crossing. The current challenge is how to efficiently screen the large volume of inbred lines based on genotypes. We present the toolbox of genome optimization via virtual simulation (GOVS), which complements the traditional genomic selection model. GOVS simulates a virtual genome encompassing the most abundant 'optimal genotypes' in a breeding population, and then assists in selection of superior lines based on the genomic fragments that a line contributes to the simulated genome. Availability of GOVS (https://govs-pack.github.io/) to the public may ultimately facilitate genomically designed breeding in maize.


Assuntos
Melhoramento Vegetal , Zea mays , Genótipo , Haploidia , Fenótipo , Melhoramento Vegetal/métodos , Zea mays/genética
7.
Brief Bioinform ; 23(3)2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35325021

RESUMO

Prediction of antimicrobial resistance based on whole-genome sequencing data has attracted greater attention due to its rapidity and convenience. Numerous machine learning-based studies have used genetic variants to predict drug resistance in Mycobacterium tuberculosis (MTB), assuming that variants are homogeneous, and most of these studies, however, have ignored the essential correlation between variants and corresponding genes when encoding variants, and used a limited number of variants as prediction input. In this study, taking advantage of genome-wide variants for drug-resistance prediction and inspired by natural language processing, we summarize drug resistance prediction into document classification, in which variants are considered as words, mutated genes in an isolate as sentences, and an isolate as a document. We propose a novel hierarchical attentive neural network model (HANN) that helps discover drug resistance-related genes and variants and acquire more interpretable biological results. It captures the interaction among variants in a mutated gene as well as among mutated genes in an isolate. Our results show that for the four first-line drugs of isoniazid (INH), rifampicin (RIF), ethambutol (EMB) and pyrazinamide (PZA), the HANN achieves the optimal area under the ROC curve of 97.90, 99.05, 96.44 and 95.14% and the optimal sensitivity of 94.63, 96.31, 92.56 and 87.05%, respectively. In addition, without any domain knowledge, the model identifies drug resistance-related genes and variants consistent with those confirmed by previous studies, and more importantly, it discovers one more potential drug-resistance-related gene.


Assuntos
Mycobacterium tuberculosis , Antituberculosos/farmacologia , Antituberculosos/uso terapêutico , Resistência a Medicamentos , Testes de Sensibilidade Microbiana , Mutação , Redes Neurais de Computação
8.
Arch Biochem Biophys ; 755: 109979, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38583654

RESUMO

Although protein sequences encode the information for folding and function, understanding their link is not an easy task. Unluckily, the prediction of how specific amino acids contribute to these features is still considerably impaired. Here, we developed a simple algorithm that finds positions in a protein sequence with potential to modulate the studied quantitative phenotypes. From a few hundred protein sequences, we perform multiple sequence alignments, obtain the per-position pairwise differences for both the sequence and the observed phenotypes, and calculate the correlation between these last two quantities. We tested our methodology with four cases: archaeal Adenylate Kinases and the organisms optimal growth temperatures, microbial rhodopsins and their maximal absorption wavelengths, mammalian myoglobins and their muscular concentration, and inhibition of HIV protease clinical isolates by two different molecules. We found from 3 to 10 positions tightly associated with those phenotypes, depending on the studied case. We showed that these correlations appear using individual positions but an improvement is achieved when the most correlated positions are jointly analyzed. Noteworthy, we performed phenotype predictions using a simple linear model that links per-position divergences and differences in the observed phenotypes. Predictions are comparable to the state-of-art methodologies which, in most of the cases, are far more complex. All of the calculations are obtained at a very low information cost since the only input needed is a multiple sequence alignment of protein sequences with their associated quantitative phenotypes. The diversity of the explored systems makes our work a valuable tool to find sequence determinants of biological activity modulation and to predict various functional features for uncharacterized members of a protein family.

9.
Hum Genomics ; 17(1): 34, 2023 03 31.
Artigo em Inglês | MEDLINE | ID: mdl-37004080

RESUMO

BACKGROUND: Phenylketonuria (PKU) is caused by mutations in the phenylalanine hydroxylase (PAH) gene. Our study aimed to predict the phenotype using the allelic genotype. METHODS: A total of 1291 PKU patients with 623 various variants were used as the training dataset for predicting allelic phenotypes. We designed a common machine learning framework to predict allelic genotypes associated with the phenotype. RESULTS: We identified 235 different mutations and 623 various allelic genotypes. The features extracted from the structure of mutations and graph properties of the PKU network to predict the phenotype of PKU were named PPML (PKU phenotype predicted by machine learning). The phenotype of PKU was classified into three different categories: classical PKU (cPKU), mild PKU (mPKU) and mild hyperphenylalaninemia (MHP). Three hub nodes (c.728G>A for cPKU, c.721 for mPKU and c.158G>A for HPA) were used as each classification center, and 5 node attributes were extracted from the network graph for machine learning training features. The area under the ROC curve was AUC = 0.832 for cPKU, AUC = 0.678 for mPKU and AUC = 0.874 for MHP. This suggests that PPML is a powerful method to predict allelic phenotypes in PKU and can be used for genetic counseling of PKU families. CONCLUSIONS: The web version of PPML predicts PKU allele classification supported by applicable real cases and prediction results. It is an online database that can be used for PKU phenotype prediction http://www.bioinfogenetics.info/PPML/ .


Assuntos
Fenilalanina Hidroxilase , Fenilcetonúrias , Humanos , Alelos , Fenilcetonúrias/diagnóstico , Fenilcetonúrias/genética , Fenótipo , Fenilalanina Hidroxilase/genética , Genótipo , Mutação
10.
Int J Legal Med ; 138(2): 627-637, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-37934208

RESUMO

Forensic entomological evidence is employed to estimate minimum postmortem interval (PMImin), location, and identification of fly samples or human remains. Traditional forensic DNA analysis (i.e., STR, mitochondrial DNA) has been used for human identification from the larval gut contents. Forensic DNA phenotyping (FDP), predicting human appearance from DNA-based crime scene evidence, has become an established approach in forensic genetics in the past years. In this study, we aimed to recover human DNA from Lucilia sericata (Meigen 1826) (Diptera: Calliphoridae) gut contents and predict the eye and hair color of individuals using the HIrisPlex system. Lucilia sericata larvae and reference blood samples were collected from 30 human volunteers who were under maggot debridement therapy. The human DNA was extracted from the crop contents and quantified. HIrisPlex multiplex analysis was performed using the SNaPshot minisequencing procedure. The HIrisPlex online tool was used to assess the prediction of the eye and hair color of the larval and reference samples. We successfully genotyped 25 out of 30 larval samples, and the most SNP genotypes (87.13%) matched those of reference samples, though some alleles were dropped out, producing partial profiles. The prediction of the eye colors was accurate in 17 out of 25 larval samples, and only one sample was misclassified. Fourteen out of 25 larval samples were correctly predicted for hair color, and eight were misclassified. This study shows that SNP analysis of L. sericata gut contents can be used to predict eye and hair color of a corpse.


Assuntos
Dípteros , Cor de Cabelo , Animais , Humanos , Larva/genética , Dípteros/genética , Genótipo , DNA Mitocondrial/genética , Cor de Olho/genética
11.
Artigo em Inglês | MEDLINE | ID: mdl-39180381

RESUMO

In order to investigate the regularity of fecal microorganisms changes in Landrace × Large White × Duroc (DLY) and Diqing Tibetan pigs (TP) induced by dietary fiber, and further explore the buffering effect of different intestinal flora structures on dietary stress. DLY (n = 15) and TP (n = 15) were divided into two treatments. Then, diet with 20% neutral detergent fiber (NDF) was supplemented for 9 days. Our results showed that the feed conversion efficiency of TP was significantly higher (p < 0.05) than that of DLY. The fecal microorganisms shared by the two groups gradually increased with the feeding cycle. In addition, the dispersion of Shannon, Simpson, ACE and Chao of TP decreased. Also, we found that the fecal microorganisms of TP (R2 = 0.2089, p < 0.01) and DLY (R2 = 0.3982, p < 0.01) showed significant differences in different feeding cycles. With the prolongation of feeding cycle, the similarity of fecal microbial composition between DLY and TP increased. Our study strongly suggests that the complex environment and diet structure have shaped the unique gut microbiota of TP, which plays a vital role in the buffering effect of high-fiber diets.

12.
Am J Hum Genet ; 107(1): 46-59, 2020 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-32470373

RESUMO

In complex trait genetics, the ability to predict phenotype from genotype is the ultimate measure of our understanding of genetic architecture underlying the heritability of a trait. A complete understanding of the genetic basis of a trait should allow for predictive methods with accuracies approaching the trait's heritability. The highly polygenic nature of quantitative traits and most common phenotypes has motivated the development of statistical strategies focused on combining myriad individually non-significant genetic effects. Now that predictive accuracies are improving, there is a growing interest in the practical utility of such methods for predicting risk of common diseases responsive to early therapeutic intervention. However, existing methods require individual-level genotypes or depend on accurately specifying the genetic architecture underlying each disease to be predicted. Here, we propose a polygenic risk prediction method that does not require explicitly modeling any underlying genetic architecture. We start with summary statistics in the form of SNP effect sizes from a large GWAS cohort. We then remove the correlation structure across summary statistics arising due to linkage disequilibrium and apply a piecewise linear interpolation on conditional mean effects. In both simulated and real datasets, this new non-parametric shrinkage (NPS) method can reliably allow for linkage disequilibrium in summary statistics of 5 million dense genome-wide markers and consistently improves prediction accuracy. We show that NPS improves the identification of groups at high risk for breast cancer, type 2 diabetes, inflammatory bowel disease, and coronary heart disease, all of which have available early intervention or prevention treatments.


Assuntos
Herança Multifatorial/genética , Idoso , Estudos de Coortes , Diabetes Mellitus Tipo 2/genética , Feminino , Estudo de Associação Genômica Ampla/métodos , Genótipo , Humanos , Desequilíbrio de Ligação/genética , Masculino , Pessoa de Meia-Idade , Modelos Genéticos , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , Locos de Características Quantitativas/genética
13.
Brief Bioinform ; 22(1): 3-19, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-31813950

RESUMO

The increasing ease with which massive genetic information can be obtained from patients or healthy individuals has stimulated the development of interpretive bioinformatics tools as aids in clinical practice. Most such tools analyze evolutionary information and simple physical-chemical properties to predict whether replacement of one amino acid residue with another will be tolerated or cause disease. Those approaches achieve up to 80-85% accuracy as binary classifiers (neutral/pathogenic). As such accuracy is insufficient for medical decision to be based on, and it does not appear to be increasing, more precise methods, such as full-atom molecular dynamics (MD) simulations in explicit solvent, are also discussed. Then, to describe the goal of interpreting human genetic variations at large scale through MD simulations, we restrictively refer to all possible protein variants carrying single-amino-acid substitutions arising from single-nucleotide variations as the human variome. We calculate its size and develop a simple model that allows calculating the simulation time needed to have a 0.99 probability of observing unfolding events of any unstable variant. The knowledge of that time enables performing a binary classification of the variants (stable-potentially neutral/unstable-pathogenic). Our model indicates that the human variome cannot be simulated with present computing capabilities. However, if they continue to increase as per Moore's law, it could be simulated (at 65°C) spending only 3 years in the task if we started in 2031. The simulation of individual protein variomes is achievable in short times starting at present. International coordination seems appropriate to embark upon massive MD simulations of protein variants.


Assuntos
Genômica/métodos , Simulação de Dinâmica Molecular , Mapeamento de Interação de Proteínas/métodos , Análise de Sequência de Proteína/métodos , Substituição de Aminoácidos , Animais , Genômica/tendências , Humanos
14.
Methods ; 205: 11-17, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-35636652

RESUMO

Microorganisms play important roles in our lives especially on metabolism and diseases. Determining the probability of human suffering from specific diseases and the severity of the disease based on microbial genes is the crucial research for understanding the relationship between microbes and diseases. Previous could extract the topological information of phylogenetic trees and integrate them to metagenomic datasets, thus enable classifiers to learn more information in limited datasets and thus improve the performance of the models. In this paper, we proposed a GNPI model to better learn the structure of phylogenetic trees. GNPI maintained the original vector format of metagenomic datasets, while previous research had to change the input form to matrices. The vector-like form of the input data can be easily adopted in the baseline machine learning models and is available for deep learning models. The datasets processed with GNPI help enhance the accuracy of machine learning and deep learning models in three different datasets. GNPI is an interpretable data processing method for host phenotype prediction and other bioinformatics tasks.


Assuntos
Metagenoma , Metagenômica , Humanos , Aprendizado de Máquina , Metagenômica/métodos , Fenótipo , Filogenia
15.
BMC Bioinformatics ; 23(1): 125, 2022 Apr 09.
Artigo em Inglês | MEDLINE | ID: mdl-35397517

RESUMO

BACKGROUND: The accurate prediction of biological features from genomic data is paramount for precision medicine and sustainable agriculture. For decades, neural network models have been widely popular in fields like computer vision, astrophysics and targeted marketing given their prediction accuracy and their robust performance under big data settings. Yet neural network models have not made a successful transition into the medical and biological world due to the ubiquitous characteristics of biological data such as modest sample sizes, sparsity, and extreme heterogeneity. RESULTS: Here, we investigate the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets. Mainly, recurrent neural network models outperform convolutional neural network models in terms of prediction accuracy, overfitting and transferability across the datasets under study. CONCLUSIONS: While the perspective of a robust out-of-the-box neural network model is out of reach, we identify certain model characteristics that translate well across datasets and could serve as a baseline model for translational researchers.


Assuntos
Big Data , Redes Neurais de Computação , Genômica , Processamento de Linguagem Natural
16.
BMC Bioinformatics ; 23(1): 262, 2022 Jul 03.
Artigo em Inglês | MEDLINE | ID: mdl-35786378

RESUMO

BACKGROUND: Machine learning is now a standard tool for cancer prediction based on gene expression data. However, deep learning is still new for this task, and there is no clear consensus about its performance and utility. Few experimental works have evaluated deep neural networks and compared them with state-of-the-art machine learning. Moreover, their conclusions are not consistent. RESULTS: We extensively evaluate the deep learning approach on 22 cancer prediction tasks based on gene expression data. We measure the impact of the main hyper-parameters and compare the performances of neural networks with the state-of-the-art. We also investigate the effectiveness of several transfer learning schemes in different experimental setups. CONCLUSION: Based on our experimentations, we provide several recommendations to optimize the construction and training of a neural network model. We show that neural networks outperform the state-of-the-art methods only for very large training set size. For a small training set, we show that transfer learning is possible and may strongly improve the model performance in some cases.


Assuntos
Aprendizado Profundo , Neoplasias , Expressão Gênica , Humanos , Aprendizado de Máquina , Neoplasias/genética , Redes Neurais de Computação
17.
BMC Plant Biol ; 22(1): 275, 2022 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-35658831

RESUMO

BACKGROUND: Predicting the phenotype from the genotype is one of the major contemporary challenges in biology. This challenge is greater in plants because their development occurs mostly post-embryonically under diurnal and seasonal environmental fluctuations. Most current crop simulation models are physiology-based models capable of capturing environmental fluctuations but cannot adequately capture genotypic effects because they were not constructed within a genetics framework. RESULTS: We describe the construction of a mixed-effects dynamic model to predict time-to-flowering in the common bean (Phaseolus vulgaris L.). This prediction model applies the developmental approach used by traditional crop simulation models, uses direct observational data, and captures the Genotype, Environment, and Genotype-by-Environment effects to predict progress towards time-to-flowering in real time. Comparisons to a traditional crop simulation model and to a previously developed static model shows the advantages of the new dynamic model. CONCLUSIONS: The dynamic model can be applied to other species and to different plant processes. These types of models can, in modular form, gradually replace plant processes in existing crop models as has been implemented in BeanGro, a crop simulation model within the DSSAT Cropping Systems Model. Gene-based dynamic models can accelerate precision breeding of diverse crop species, particularly with the prospects of climate change. Finally, a gene-based simulation model can assist policy decision makers in matters pertaining to prediction of food supplies.


Assuntos
Phaseolus , Melhoramento Vegetal , Simulação por Computador , Genótipo , Phaseolus/genética , Fenótipo
18.
IUBMB Life ; 74(12): 1273-1287, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36345613

RESUMO

Predicting phenotypes and complex traits from genomic variations has always been a big challenge in molecular biology, at least in part because the task is often complicated by the influences of external stimuli and the environment on regulation of gene expression. With today's abundance of omic data and advances in high-throughput computing and machine learning (ML), we now have an unprecedented opportunity to uncover the missing links and molecular mechanisms that control gene expression and phenotypes. To empower molecular biologists and researchers in related fields to start using ML for in-depth analyses of their large-scale data, here we provide a summary of fundamental concepts of machine learning, and describe a wide range of research questions and scenarios in molecular biology where ML has been implemented. Due to the abundance of data, reproducibility, and genome-wide coverage, we focus on transcriptomics, and two ML tasks involving it: (a) predicting of transcriptomic profiles or transcription levels from genomic variations in DNA, and (b) predicting phenotypes of interest from transcriptomic profiles or transcription levels. Similar approaches can also be applied to more complex data such as those in multi-omic studies. We envisage that the concepts and examples described here will raise awareness and promote the application of ML among molecular biologists, and eventually help improve a framework for systematic design and predictions of gene expression and phenotypes for synthetic biology applications.


Assuntos
Genômica , Aprendizado de Máquina , Reprodutibilidade dos Testes , Fenótipo , Genoma
19.
Appl Microbiol Biotechnol ; 106(13-16): 4907-4920, 2022 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-35829788

RESUMO

Over the last two decades, thousands of genome-scale metabolic network models (GSMMs) have been constructed. These GSMMs have been widely applied in various fields, ranging from network interaction analysis, to cell phenotype prediction. However, due to the lack of constraints, the prediction accuracy of first-generation GSMMs was limited. To overcome these limitations, the next-generation GSMMs were developed by integrating omics data, adding constrain condition, integrating different biological models, and constructing whole-cell models. Here, we review recent advances of GSMMs from the first generation to the next generation. Then, we discuss the major application of GSMMs in industrial biotechnology, such as predicting phenotypes and guiding metabolic engineering. In addition, human health applications, including understanding biological mechanisms, discovering biomarkers and drug targets, are also summarized. Finally, we address the challenges and propose new trend of GSMMs. KEY POINTS: •This mini-review updates the literature on almost all published GSMMs since 1999. •Detailed insights into the development of the first- and next-generation GSMMs. •The application of GSMMs is summarized, and the prospects of integrating machine learning are emphasized.


Assuntos
Engenharia Metabólica , Redes e Vias Metabólicas , Genoma , Humanos , Aprendizado de Máquina , Redes e Vias Metabólicas/genética , Modelos Biológicos
20.
Int J Mol Sci ; 23(21)2022 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-36361765

RESUMO

Noise is a basic ingredient in data, since observed data are always contaminated by unwanted deviations, i.e., noise, which, in the case of overdetermined systems (with more data than model parameters), cause the corresponding linear system of equations to have an imperfect solution. In addition, in the case of highly underdetermined parameterization, noise can be absorbed by the model, generating spurious solutions. This is a very undesirable situation that might lead to incorrect conclusions. We presented mathematical formalism based on the inverse problem theory combined with artificial intelligence methodologies to perform an enhanced sampling of noisy biomedical data to improve the finding of meaningful solutions. Random sampling methods fail for high-dimensional biomedical problems. Sampling methods such as smart model parameterizations, forward surrogates, and parallel computing are better suited for such problems. We applied these methods to several important biomedical problems, such as phenotype prediction and a problem related to predicting the effects of protein mutations, i.e., if a given single residue mutation is neutral or deleterious, causing a disease. We also applied these methods to de novo drug discovery and drug repositioning (repurposing) through the enhanced exploration of huge chemical space. The purpose of these novel methods that address the problem of noise and uncertainty in biomedical data is to find new therapeutic solutions, perform drug repurposing, and accelerate and optimize drug discovery, thus reestablishing homeostasis. Finding the right target, the right compound, and the right patient are the three bottlenecks to running successful clinical trials from the correct analysis of preclinical models. Artificial intelligence can provide a solution to these problems, considering that the character of the data restricts the quality of the prediction, as in any modeling procedure in data analysis. The use of simple and plain methodologies is crucial to tackling these important and challenging problems, particularly drug repositioning/repurposing in rare diseases.


Assuntos
Inteligência Artificial , Reposicionamento de Medicamentos , Incerteza , Reposicionamento de Medicamentos/métodos , Descoberta de Drogas/métodos , Fenótipo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA