Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 58
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 40(2)2024 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-38273672

RESUMO

MOTIVATION: Proteomic profiles reflect the functional readout of the physiological state of an organism. An increased understanding of what controls and defines protein abundances is of high scientific interest. Saccharomyces cerevisiae is a well-studied model organism, and there is a large amount of structured knowledge on yeast systems biology in databases such as the Saccharomyces Genome Database, and highly curated genome-scale metabolic models like Yeast8. These datasets, the result of decades of experiments, are abundant in information, and adhere to semantically meaningful ontologies. RESULTS: By representing this knowledge in an expressive Datalog database we generated data descriptors using relational learning that, when combined with supervised machine learning, enables us to predict protein abundances in an explainable manner. We learnt predictive relationships between protein abundances, function and phenotype; such as α-amino acid accumulations and deviations in chronological lifespan. We further demonstrate the power of this methodology on the proteins His4 and Ilv2, connecting qualitative biological concepts to quantified abundances. AVAILABILITY AND IMPLEMENTATION: All data and processing scripts are available at the following Github repository: https://github.com/DanielBrunnsaker/ProtPredict.


Assuntos
Proteínas de Saccharomyces cerevisiae , Saccharomyces cerevisiae , Saccharomyces cerevisiae/genética , Proteômica , Proteínas de Saccharomyces cerevisiae/genética , Biologia de Sistemas/métodos , Fenótipo
3.
Bioinformatics ; 39(8)2023 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-37572302

RESUMO

MOTIVATION: Molecular docking is a commonly used approach for estimating binding conformations and their resultant binding affinities. Machine learning has been successfully deployed to enhance such affinity estimations. Many methods of varying complexity have been developed making use of some or all the spatial and categorical information available in these structures. The evaluation of such methods has mainly been carried out using datasets from PDBbind. Particularly the Comparative Assessment of Scoring Functions (CASF) 2007, 2013, and 2016 datasets with dedicated test sets. This work demonstrates that only a small number of simple descriptors is necessary to efficiently estimate binding affinity for these complexes without the need to know the exact binding conformation of a ligand. RESULTS: The developed approach of using a small number of ligand and protein descriptors in conjunction with gradient boosting trees demonstrates high performance on the CASF datasets. This includes the commonly used benchmark CASF2016 where it appears to perform better than any other approach. This methodology is also useful for datasets where the spatial relationship between the ligand and protein is unknown as demonstrated using a large ChEMBL-derived dataset. AVAILABILITY AND IMPLEMENTATION: Code and data uploaded to https://github.com/abbiAR/PLBAffinity.


Assuntos
Aprendizado de Máquina , Proteínas , Simulação de Acoplamento Molecular , Ligantes , Ligação Proteica , Proteínas/química
4.
Proc Natl Acad Sci U S A ; 118(49)2021 12 07.
Artigo em Inglês | MEDLINE | ID: mdl-34845013

RESUMO

Almost all machine learning (ML) is based on representing examples using intrinsic features. When there are multiple related ML problems (tasks), it is possible to transform these features into extrinsic features by first training ML models on other tasks and letting them each make predictions for each example of the new task, yielding a novel representation. We call this transformational ML (TML). TML is very closely related to, and synergistic with, transfer learning, multitask learning, and stacking. TML is applicable to improving any nonlinear ML method. We tested TML using the most important classes of nonlinear ML: random forests, gradient boosting machines, support vector machines, k-nearest neighbors, and neural networks. To ensure the generality and robustness of the evaluation, we utilized thousands of ML problems from three scientific domains: drug design, predicting gene expression, and ML algorithm selection. We found that TML significantly improved the predictive performance of all the ML methods in all the domains (4 to 50% average improvements) and that TML features generally outperformed intrinsic features. Use of TML also enhances scientific understanding through explainable ML. In drug design, we found that TML provided insight into drug target specificity, the relationships between drugs, and the relationships between target proteins. TML leads to an ecosystem-based approach to ML, where new tasks, examples, predictions, and so on synergistically interact to improve performance. To contribute to this ecosystem, all our data, code, and our ∼50,000 ML models have been fully annotated with metadata, linked, and openly published using Findability, Accessibility, Interoperability, and Reusability principles (∼100 Gbytes).

5.
BMC Bioinformatics ; 23(1): 323, 2022 Aug 06.
Artigo em Inglês | MEDLINE | ID: mdl-35933367

RESUMO

BACKGROUND: A key problem in bioinformatics is that of predicting gene expression levels. There are two broad approaches: use of mechanistic models that aim to directly simulate the underlying biology, and use of machine learning (ML) to empirically predict expression levels from descriptors of the experiments. There are advantages and disadvantages to both approaches: mechanistic models more directly reflect the underlying biological causation, but do not directly utilize the available empirical data; while ML methods do not fully utilize existing biological knowledge. RESULTS: Here, we investigate overcoming these disadvantages by integrating mechanistic cell signalling models with ML. Our approach to integration is to augment ML with similarity features (attributes) computed from cell signalling models. Seven sets of different similarity feature were generated using graph theory. Each set of features was in turn used to learn multi-target regression models. All the features have significantly improved accuracy over the baseline model - without the similarity features. Finally, the seven multi-target regression models were stacked together to form an overall prediction model that was significantly better than the baseline on 95% of genes on an independent test set. The similarity features enable this stacking model to provide interpretable knowledge about cancer, e.g. the role of ERBB3 in the MCF7 breast cancer cell line. CONCLUSION: Integrating mechanistic models as graphs helps to both improve the predictive results of machine learning models, and to provide biological knowledge about genes that can help in building state-of-the-art mechanistic models.


Assuntos
Aprendizado de Máquina , Neoplasias , Biologia Computacional/métodos , Expressão Gênica , Humanos
6.
Proc Natl Acad Sci U S A ; 116(36): 18142-18147, 2019 09 03.
Artigo em Inglês | MEDLINE | ID: mdl-31420515

RESUMO

One of the most challenging tasks in modern science is the development of systems biology models: Existing models are often very complex but generally have low predictive performance. The construction of high-fidelity models will require hundreds/thousands of cycles of model improvement, yet few current systems biology research studies complete even a single cycle. We combined multiple software tools with integrated laboratory robotics to execute three cycles of model improvement of the prototypical eukaryotic cellular transformation, the yeast (Saccharomyces cerevisiae) diauxic shift. In the first cycle, a model outperforming the best previous diauxic shift model was developed using bioinformatic and systems biology tools. In the second cycle, the model was further improved using automatically planned experiments. In the third cycle, hypothesis-led experiments improved the model to a greater extent than achieved using high-throughput experiments. All of the experiments were formalized and communicated to a cloud laboratory automation system (Eve) for automatic execution, and the results stored on the semantic web for reuse. The final model adds a substantial amount of knowledge about the yeast diauxic shift: 92 genes (+45%), and 1,048 interactions (+147%). This knowledge is also relevant to understanding cancer, the immune system, and aging. We conclude that systems biology software tools can be combined and integrated with laboratory robots in closed-loop cycles.


Assuntos
Biologia Computacional , Regulação Fúngica da Expressão Gênica , Robótica , Saccharomyces cerevisiae , Software , Biologia de Sistemas , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo
8.
BMC Bioinformatics ; 15 Suppl 14: S5, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25472549

RESUMO

BACKGROUND: The reliability and reproducibility of experimental procedures is a cornerstone of scientific practice. There is a pressing technological need for the better representation of biomedical protocols to enable other agents (human or machine) to better reproduce results. A framework that ensures that all information required for the replication of experimental protocols is essential to achieve reproducibility. To construct EXACT2 we manually inspected hundreds of published and commercial biomedical protocols from several areas of biomedicine. After establishing a clear pattern for extracting the required information we utilized text-mining tools to translate the protocols into a machine amenable format. We have verified the utility of EXACT2 through the successful processing of previously 'unseen' (not used for the construction of EXACT2)protocols. METHODS: We have developed the ontology EXACT2 (EXperimental ACTions) that is designed to capture the full semantics of biomedical protocols required for their reproducibility. RESULTS: The paper reports on a fundamentally new version EXACT2 that supports the semantically-defined representation of biomedical protocols. The ability of EXACT2 to capture the semantics of biomedical procedures was verified through a text mining use case. In this EXACT2 is used as a reference model for text mining tools to identify terms pertinent to experimental actions, and their properties, in biomedical protocols expressed in natural language. An EXACT2-based framework for the translation of biomedical protocols to a machine amenable format is proposed. CONCLUSIONS: The EXACT2 ontology is sufficient to record, in a machine processable form, the essential information about biomedical protocols. EXACT2 defines explicit semantics of experimental actions, and can be used by various computer applications. It can serve as a reference model for for the translation of biomedical protocols in natural language into a semantically-defined format.


Assuntos
Ontologias Biológicas , Mineração de Dados , Software , Processamento Eletrônico de Dados , Idioma , Reprodutibilidade dos Testes , Semântica
9.
Microbiol Spectr ; : e0124924, 2024 Aug 20.
Artigo em Inglês | MEDLINE | ID: mdl-39162260

RESUMO

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus continues to cause severe disease and deaths in many parts of the world, despite massive vaccination efforts. Antiviral drugs to curb an ongoing infection remain a priority. The virus-encoded 3C-like main protease (MPro; nsp5) is seen as a promising target. Here, with a positive selection genetic system engineered in Saccharomyces cerevisiae using cleavage and release of MazF toxin as an indicator, we screened in a robotized setup small molecule libraries comprising ~2,500 compounds for MPro inhibitors. We detected eight compounds as effective against MPro expressed in yeast, five of which are characterized proteasome inhibitors. Molecular docking indicates that most of these bind covalently to the MPro catalytically active cysteine. Compounds were confirmed as MPro inhibitors in an in vitro enzymatic assay. Among those were three previously only predicted in silico; the boron-containing proteasome inhibitors bortezomib, delanzomib, and ixazomib. Importantly, we establish reaction conditions in vitro preserving the MPro-inhibitory activity of the boron-containing drugs. These differ from the standard conditions, which may explain why boron compounds have gone undetected in screens based on enzymatic in vitro assays. Our screening system is robust and can find inhibitors of a specific protease that are biostable, able to penetrate a cell membrane, and are not generally toxic. As a cellular assay, it can detect inhibitors that fail in a screen based on an in vitro enzymatic assay using standardized conditions, and now give support for boron compounds as MPro inhibitors. This method can also be adapted for other viral proteases.IMPORTANCEThe coronavirus disease 2019 (COVID-19) pandemic triggered the realization that we need flexible approaches to find treatments for emerging viral threats. We implemented a genetically engineered platform in yeast to detect inhibitors of the virus's main protease (MPro), a promising target to curb severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections. Screening molecule libraries, we identified candidate inhibitors and verified them in a biochemical assay. Moreover, the system detected boron-containing molecules as MPro inhibitors. Those were previously predicted computationally but never shown effective in a biochemical assay. Here, we demonstrate that they require a non-standard reaction buffer to function as MPro inhibitors. Hence, our cell-based method detects protease inhibitors missed by other approaches and provides support for the boron-containing molecules. We have thus demonstrated that our platform can screen large numbers of chemicals to find potential inhibitors of a viral protease. Importantly, the platform can be modified to detect protease targets from other emerging viruses.

10.
J Am Soc Mass Spectrom ; 35(3): 542-550, 2024 Mar 06.
Artigo em Inglês | MEDLINE | ID: mdl-38310603

RESUMO

Automation is dramatically changing the nature of laboratory life science. Robotic lab hardware that can perform manual operations with greater speed, endurance, and reproducibility opens an avenue for faster scientific discovery with less time spent on laborious repetitive tasks. A major bottleneck remains in integrating cutting-edge laboratory equipment into automated workflows, notably specialized analytical equipment, which is designed for human usage. Here we present AutonoMS, a platform for automatically running, processing, and analyzing high-throughput mass spectrometry experiments. AutonoMS is currently written around an ion mobility mass spectrometry (IM-MS) platform and can be adapted to additional analytical instruments and data processing flows. AutonoMS enables automated software agent-controlled end-to-end measurement and analysis runs from experimental specification files that can be produced by human users or upstream software processes. We demonstrate the use and abilities of AutonoMS in a high-throughput flow-injection ion mobility configuration with 5 s sample analysis time, processing robotically prepared chemical standards and cultured yeast samples in targeted and untargeted metabolomics applications. The platform exhibited consistency, reliability, and ease of use while eliminating the need for human intervention in the process of sample injection, data processing, and analysis. The platform paves the way toward a more fully automated mass spectrometry analysis and ultimately closed-loop laboratory workflows involving automated experimentation and analysis coupled to AI-driven experimentation utilizing cutting-edge analytical instrumentation. AutonoMS documentation is available at https://autonoms.readthedocs.io.


Assuntos
Metabolômica , Software , Humanos , Reprodutibilidade dos Testes , Espectrometria de Massas , Automação
11.
Bioinform Adv ; 3(1): vbad102, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37600845

RESUMO

Summary: Artificial intelligence (AI)-driven laboratory automation-combining robotic labware and autonomous software agents-is a powerful trend in modern biology. We developed Genesis-DB, a database system designed to support AI-driven autonomous laboratories by providing software agents access to large quantities of structured domain information. In addition, we present a new ontology for modeling data and metadata from autonomously performed yeast microchemostat cultivations in the framework of the Genesis robot scientist system. We show an example of how Genesis-DB enables the research life cycle by modeling yeast gene regulation, guiding future hypotheses generation and design of experiments. Genesis-DB supports AI-driven discovery through automated reasoning and its design is portable, generic, and easily extensible to other AI-driven molecular biology laboratory data and beyond. Availability and implementation: Genesis-DB code and installation instructions are available at the GitHub repository https://github.com/TW-Genesis/genesis-database-system.git. The database use case demo code and data are also available through GitHub (https://github.com/TW-Genesis/genesis-database-demo.git). The ontology can be downloaded here: https://github.com/TW-Genesis/genesis-ontology/releases/download/v0.0.23/genesis.owl. The ontology term descriptions (including mappings to existing ontologies) and maintenance standard operating procedures can be found at: https://github.com/TW-Genesis/genesis-ontology.

12.
NPJ Syst Biol Appl ; 9(1): 11, 2023 04 07.
Artigo em Inglês | MEDLINE | ID: mdl-37029131

RESUMO

Saccharomyces cerevisiae is a very well studied organism, yet ∼20% of its proteins remain poorly characterized. Moreover, recent studies seem to indicate that the pace of functional discovery is slow. Previous work has implied that the most probable path forward is via not only automation but fully autonomous systems in which active learning is applied to guide high-throughput experimentation. Development of tools and methods for these types of systems is of paramount importance. In this study we use constrained dynamical flux balance analysis (dFBA) to select ten regulatory deletant strains that are likely to have previously unexplored connections to the diauxic shift. We then analyzed these deletant strains using untargeted metabolomics, generating profiles which were then subsequently investigated to better understand the consequences of the gene deletions in the metabolic reconfiguration of the diauxic shift. We show that metabolic profiles can be utilised to not only gaining insight into cellular transformations such as the diauxic shift, but also on regulatory roles and biological consequences of regulatory gene deletion. We also conclude that untargeted metabolomics is a useful tool for guidance in high-throughput model improvement, and is a fast, sensitive and informative approach appropriate for future large-scale functional analyses of genes. Moreover, it is well-suited for automated approaches due to relative simplicity of processing and the potential to make massively high-throughput.


Assuntos
Proteínas de Saccharomyces cerevisiae , Saccharomyces cerevisiae , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Metabolômica/métodos
13.
R Soc Open Sci ; 9(5): 211745, 2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-35573039

RESUMO

The representation of the protein-ligand complexes used in building machine learning models play an important role in the accuracy of binding affinity prediction. The Extended Connectivity Interaction Features (ECIF) is one such representation. We report that (i) including the discretized distances between protein-ligand atom pairs in the ECIF scheme improves predictive accuracy, and (ii) in an evaluation using gradient boosted trees, we found that the resampling method used in selecting the best hyperparameters has a strong effect on predictive performance, especially for benchmarking purposes.

14.
J R Soc Interface ; 19(189): 20210821, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35382578

RESUMO

Scientific results should not just be 'repeatable' (replicable in the same laboratory under identical conditions), but also 'reproducible' (replicable in other laboratories under similar conditions). Results should also, if possible, be 'robust' (replicable under a wide range of conditions). The reproducibility and robustness of only a small fraction of published biomedical results has been tested; furthermore, when reproducibility is tested, it is often not found. This situation is termed 'the reproducibility crisis', and it is one the most important issues facing biomedicine. This crisis would be solved if it were possible to automate reproducibility testing. Here, we describe the semi-automated testing for reproducibility and robustness of simple statements (propositions) about cancer cell biology automatically extracted from the literature. From 12 260 papers, we automatically extracted statements predicted to describe experimental results regarding a change of gene expression in response to drug treatment in breast cancer, from these we selected 74 statements of high biomedical interest. To test the reproducibility of these statements, two different teams used the laboratory automation system Eve and two breast cancer cell lines (MCF7 and MDA-MB-231). Statistically significant evidence for repeatability was found for 43 statements, and significant evidence for reproducibility/robustness in 22 statements. In two cases, the automation made serendipitous discoveries. The reproduced/robust knowledge provides significant insight into cancer. We conclude that semi-automated reproducibility testing is currently achievable, that it could be scaled up to generate a substantive source of reliable knowledge and that automation has the potential to mitigate the reproducibility crisis.


Assuntos
Neoplasias da Mama , Robótica , Automação , Biologia , Feminino , Humanos , Reprodutibilidade dos Testes
15.
Plant Physiol ; 153(4): 1506-20, 2010 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-20566707

RESUMO

Metabolite fingerprinting of Arabidopsis (Arabidopsis thaliana) mutants with known or predicted metabolic lesions was performed by (1)H-nuclear magnetic resonance, Fourier transform infrared, and flow injection electrospray-mass spectrometry. Fingerprinting enabled processing of five times more plants than conventional chromatographic profiling and was competitive for discriminating mutants, other than those affected in only low-abundance metabolites. Despite their rapidity and complexity, fingerprints yielded metabolomic insights (e.g. that effects of single lesions were usually not confined to individual pathways). Among fingerprint techniques, (1)H-nuclear magnetic resonance discriminated the most mutant phenotypes from the wild type and Fourier transform infrared discriminated the fewest. To maximize information from fingerprints, data analysis was crucial. One-third of distinctive phenotypes might have been overlooked had data models been confined to principal component analysis score plots. Among several methods tested, machine learning (ML) algorithms, namely support vector machine or random forest (RF) classifiers, were unsurpassed for phenotype discrimination. Support vector machines were often the best performing classifiers, but RFs yielded some particularly informative measures. First, RFs estimated margins between mutant phenotypes, whose relations could then be visualized by Sammon mapping or hierarchical clustering. Second, RFs provided importance scores for the features within fingerprints that discriminated mutants. These scores correlated with analysis of variance F values (as did Kruskal-Wallis tests, true- and false-positive measures, mutual information, and the Relief feature selection algorithm). ML classifiers, as models trained on one data set to predict another, were ideal for focused metabolomic queries, such as the distinctiveness and consistency of mutant phenotypes. Accessible software for use of ML in plant physiology is highlighted.


Assuntos
Arabidopsis/metabolismo , Inteligência Artificial , Metabolômica , Algoritmos , Análise por Conglomerados , Espectroscopia de Ressonância Magnética , Espectrometria de Massas , Fenótipo , Análise de Componente Principal , Espectroscopia de Infravermelho com Transformada de Fourier
16.
mSystems ; 6(6): e0108721, 2021 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-34812651

RESUMO

The ongoing COVID-19 pandemic urges searches for antiviral agents that can block infection or ameliorate its symptoms. Using dissimilar search strategies for new antivirals will improve our overall chances of finding effective treatments. Here, we have established an experimental platform for screening of small molecule inhibitors of the SARS-CoV-2 main protease in Saccharomyces cerevisiae cells, genetically engineered to enhance cellular uptake of small molecules in the environment. The system consists of a fusion of the Escherichia coli toxin MazF and its antitoxin MazE, with insertion of a protease cleavage site in the linker peptide connecting the MazE and MazF moieties. Expression of the viral protease confers cleavage of the MazEF fusion, releasing the MazF toxin from its antitoxin, resulting in growth inhibition. In the presence of a small molecule inhibiting the protease, cleavage is blocked and the MazF toxin remains inhibited, promoting growth. The system thus allows positive selection for inhibitors. The engineered yeast strain is tagged with a fluorescent marker protein, allowing precise monitoring of its growth in the presence or absence of inhibitor. We detect an established main protease inhibitor by a robust growth increase, discernible down to 1 µM. The system is suitable for robotized large-scale screens. It allows in vivo evaluation of drug candidates and is rapidly adaptable for new variants of the protease with deviant site specificities. IMPORTANCE The COVID-19 pandemic may continue for several years before vaccination campaigns can put an end to it globally. Thus, the need for discovery of new antiviral drug candidates will remain. We have engineered a system in yeast cells for the detection of small molecule inhibitors of one attractive drug target of SARS-CoV-2, its main protease, which is required for viral replication. The ability to detect inhibitors in live cells brings the advantage that only compounds capable of entering the cell and remain stable there will score in the system. Moreover, because of its design in yeast cells, the system is rapidly adaptable for tuning the detection level and eventual modification of the protease cleavage site in the case of future mutant variants of the SARS-CoV-2 main protease or even for other proteases.

17.
Bioinformatics ; 25(16): 2020-7, 2009 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-19535531

RESUMO

MOTIVATION: Distribution analysis is one of the most basic forms of statistical analysis. Thanks to improved analytical methods, accurate and extensive quantitative measurements can now be made of the mRNA, protein and metabolite from biological systems. Here, we report a large-scale analysis of the population abundance distributions of the transcriptomes, proteomes and metabolomes from varied biological systems. RESULTS: We compared the observed empirical distributions with a number of distributions: power law, lognormal, loglogistic, loggamma, right Pareto-lognormal (PLN) and double PLN (dPLN). The best-fit for mRNA, protein and metabolite population abundance distributions was found to be the dPLN. This distribution behaves like a lognormal distribution around the centre, and like a power law distribution in the tails. To better understand the cause of this observed distribution, we explored a simple stochastic model based on geometric Brownian motion. The distribution indicates that multiplicative effects are causally dominant in biological systems. We speculate that these effects arise from chemical reactions: the central-limit theorem then explains the central lognormal, and a number of possible mechanisms could explain the long tails: positive feedback, network topology, etc. Many of the components in the central lognormal parts of the empirical distributions are unidentified and/or have unknown function. This indicates that much more biology awaits discovery.


Assuntos
Metaboloma , Proteínas/metabolismo , RNA Mensageiro/metabolismo , Simulação por Computador , Modelos Biológicos , Proteínas/química
18.
Nature ; 427(6971): 247-52, 2004 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-14724639

RESUMO

The question of whether it is possible to automate the scientific process is of both great theoretical interest and increasing practical importance because, in many scientific areas, data are being generated much faster than they can be effectively analysed. We describe a physically implemented robotic system that applies techniques from artificial intelligence to carry out cycles of scientific experimentation. The system automatically originates hypotheses to explain observations, devises experiments to test these hypotheses, physically runs the experiments using a laboratory robot, interprets the results to falsify hypotheses inconsistent with the data, and then repeats the cycle. Here we apply the system to the determination of gene function using deletion mutants of yeast (Saccharomyces cerevisiae) and auxotrophic growth experiments. We built and tested a detailed logical model (involving genes, proteins and metabolites) of the aromatic amino acid synthesis pathway. In biological experiments that automatically reconstruct parts of this model, we show that an intelligent experiment selection strategy is competitive with human performance and significantly outperforms, with a cost decrease of 3-fold and 100-fold (respectively), both cheapest and random-experiment selection.


Assuntos
Genômica/instrumentação , Genômica/métodos , Modelos Biológicos , Projetos de Pesquisa , Pesquisadores/estatística & dados numéricos , Pesquisa/instrumentação , Robótica/métodos , Algoritmos , Aminoácidos/biossíntese , Biologia Computacional , Simulação por Computador , Análise Custo-Benefício , Eficiência , Deleção de Genes , Genes Fúngicos/genética , Humanos , Aprendizagem , Fases de Leitura Aberta , Fenótipo , Probabilidade , Pesquisadores/normas , Robótica/instrumentação , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Software , Fatores de Tempo , Recursos Humanos
19.
Mach Learn ; 109(2): 251-277, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32174648

RESUMO

In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.

20.
Bioinformatics ; 24(13): i295-303, 2008 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-18586727

RESUMO

MOTIVATION: Many published manuscripts contain experiment protocols which are poorly described or deficient in information. This means that the published results are very hard or impossible to repeat. This problem is being made worse by the increasing complexity of high-throughput/automated methods. There is therefore a growing need to represent experiment protocols in an efficient and unambiguous way. RESULTS: We have developed the Experiment ACTions (EXACT) ontology as the basis of a method of representing biological laboratory protocols. We provide example protocols that have been formalized using EXACT, and demonstrate the advantages and opportunities created by using this formalization. We argue that the use of EXACT will result in the publication of protocols with increased clarity and usefulness to the scientific community. AVAILABILITY: The ontology, examples and code can be downloaded from http://www.aber.ac.uk/compsci/Research/bio/dss/EXACT/.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Documentação/métodos , Armazenamento e Recuperação da Informação/métodos , Internet , Pesquisa/classificação , Pesquisa/normas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA