Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 57
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 40(2)2024 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-38273672

RESUMEN

MOTIVATION: Proteomic profiles reflect the functional readout of the physiological state of an organism. An increased understanding of what controls and defines protein abundances is of high scientific interest. Saccharomyces cerevisiae is a well-studied model organism, and there is a large amount of structured knowledge on yeast systems biology in databases such as the Saccharomyces Genome Database, and highly curated genome-scale metabolic models like Yeast8. These datasets, the result of decades of experiments, are abundant in information, and adhere to semantically meaningful ontologies. RESULTS: By representing this knowledge in an expressive Datalog database we generated data descriptors using relational learning that, when combined with supervised machine learning, enables us to predict protein abundances in an explainable manner. We learnt predictive relationships between protein abundances, function and phenotype; such as α-amino acid accumulations and deviations in chronological lifespan. We further demonstrate the power of this methodology on the proteins His4 and Ilv2, connecting qualitative biological concepts to quantified abundances. AVAILABILITY AND IMPLEMENTATION: All data and processing scripts are available at the following Github repository: https://github.com/DanielBrunnsaker/ProtPredict.


Asunto(s)
Proteínas de Saccharomyces cerevisiae , Saccharomyces cerevisiae , Saccharomyces cerevisiae/genética , Proteómica , Proteínas de Saccharomyces cerevisiae/genética , Biología de Sistemas/métodos , Fenotipo
3.
Bioinformatics ; 39(8)2023 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-37572302

RESUMEN

MOTIVATION: Molecular docking is a commonly used approach for estimating binding conformations and their resultant binding affinities. Machine learning has been successfully deployed to enhance such affinity estimations. Many methods of varying complexity have been developed making use of some or all the spatial and categorical information available in these structures. The evaluation of such methods has mainly been carried out using datasets from PDBbind. Particularly the Comparative Assessment of Scoring Functions (CASF) 2007, 2013, and 2016 datasets with dedicated test sets. This work demonstrates that only a small number of simple descriptors is necessary to efficiently estimate binding affinity for these complexes without the need to know the exact binding conformation of a ligand. RESULTS: The developed approach of using a small number of ligand and protein descriptors in conjunction with gradient boosting trees demonstrates high performance on the CASF datasets. This includes the commonly used benchmark CASF2016 where it appears to perform better than any other approach. This methodology is also useful for datasets where the spatial relationship between the ligand and protein is unknown as demonstrated using a large ChEMBL-derived dataset. AVAILABILITY AND IMPLEMENTATION: Code and data uploaded to https://github.com/abbiAR/PLBAffinity.


Asunto(s)
Aprendizaje Automático , Proteínas , Simulación del Acoplamiento Molecular , Ligandos , Unión Proteica , Proteínas/química
4.
Proc Natl Acad Sci U S A ; 118(49)2021 12 07.
Artículo en Inglés | MEDLINE | ID: mdl-34845013

RESUMEN

Almost all machine learning (ML) is based on representing examples using intrinsic features. When there are multiple related ML problems (tasks), it is possible to transform these features into extrinsic features by first training ML models on other tasks and letting them each make predictions for each example of the new task, yielding a novel representation. We call this transformational ML (TML). TML is very closely related to, and synergistic with, transfer learning, multitask learning, and stacking. TML is applicable to improving any nonlinear ML method. We tested TML using the most important classes of nonlinear ML: random forests, gradient boosting machines, support vector machines, k-nearest neighbors, and neural networks. To ensure the generality and robustness of the evaluation, we utilized thousands of ML problems from three scientific domains: drug design, predicting gene expression, and ML algorithm selection. We found that TML significantly improved the predictive performance of all the ML methods in all the domains (4 to 50% average improvements) and that TML features generally outperformed intrinsic features. Use of TML also enhances scientific understanding through explainable ML. In drug design, we found that TML provided insight into drug target specificity, the relationships between drugs, and the relationships between target proteins. TML leads to an ecosystem-based approach to ML, where new tasks, examples, predictions, and so on synergistically interact to improve performance. To contribute to this ecosystem, all our data, code, and our ∼50,000 ML models have been fully annotated with metadata, linked, and openly published using Findability, Accessibility, Interoperability, and Reusability principles (∼100 Gbytes).

5.
BMC Bioinformatics ; 23(1): 323, 2022 Aug 06.
Artículo en Inglés | MEDLINE | ID: mdl-35933367

RESUMEN

BACKGROUND: A key problem in bioinformatics is that of predicting gene expression levels. There are two broad approaches: use of mechanistic models that aim to directly simulate the underlying biology, and use of machine learning (ML) to empirically predict expression levels from descriptors of the experiments. There are advantages and disadvantages to both approaches: mechanistic models more directly reflect the underlying biological causation, but do not directly utilize the available empirical data; while ML methods do not fully utilize existing biological knowledge. RESULTS: Here, we investigate overcoming these disadvantages by integrating mechanistic cell signalling models with ML. Our approach to integration is to augment ML with similarity features (attributes) computed from cell signalling models. Seven sets of different similarity feature were generated using graph theory. Each set of features was in turn used to learn multi-target regression models. All the features have significantly improved accuracy over the baseline model - without the similarity features. Finally, the seven multi-target regression models were stacked together to form an overall prediction model that was significantly better than the baseline on 95% of genes on an independent test set. The similarity features enable this stacking model to provide interpretable knowledge about cancer, e.g. the role of ERBB3 in the MCF7 breast cancer cell line. CONCLUSION: Integrating mechanistic models as graphs helps to both improve the predictive results of machine learning models, and to provide biological knowledge about genes that can help in building state-of-the-art mechanistic models.


Asunto(s)
Aprendizaje Automático , Neoplasias , Biología Computacional/métodos , Expresión Génica , Humanos
6.
Proc Natl Acad Sci U S A ; 116(36): 18142-18147, 2019 09 03.
Artículo en Inglés | MEDLINE | ID: mdl-31420515

RESUMEN

One of the most challenging tasks in modern science is the development of systems biology models: Existing models are often very complex but generally have low predictive performance. The construction of high-fidelity models will require hundreds/thousands of cycles of model improvement, yet few current systems biology research studies complete even a single cycle. We combined multiple software tools with integrated laboratory robotics to execute three cycles of model improvement of the prototypical eukaryotic cellular transformation, the yeast (Saccharomyces cerevisiae) diauxic shift. In the first cycle, a model outperforming the best previous diauxic shift model was developed using bioinformatic and systems biology tools. In the second cycle, the model was further improved using automatically planned experiments. In the third cycle, hypothesis-led experiments improved the model to a greater extent than achieved using high-throughput experiments. All of the experiments were formalized and communicated to a cloud laboratory automation system (Eve) for automatic execution, and the results stored on the semantic web for reuse. The final model adds a substantial amount of knowledge about the yeast diauxic shift: 92 genes (+45%), and 1,048 interactions (+147%). This knowledge is also relevant to understanding cancer, the immune system, and aging. We conclude that systems biology software tools can be combined and integrated with laboratory robots in closed-loop cycles.


Asunto(s)
Biología Computacional , Regulación Fúngica de la Expresión Génica , Robótica , Saccharomyces cerevisiae , Programas Informáticos , Biología de Sistemas , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo
8.
BMC Bioinformatics ; 15 Suppl 14: S5, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25472549

RESUMEN

BACKGROUND: The reliability and reproducibility of experimental procedures is a cornerstone of scientific practice. There is a pressing technological need for the better representation of biomedical protocols to enable other agents (human or machine) to better reproduce results. A framework that ensures that all information required for the replication of experimental protocols is essential to achieve reproducibility. To construct EXACT2 we manually inspected hundreds of published and commercial biomedical protocols from several areas of biomedicine. After establishing a clear pattern for extracting the required information we utilized text-mining tools to translate the protocols into a machine amenable format. We have verified the utility of EXACT2 through the successful processing of previously 'unseen' (not used for the construction of EXACT2)protocols. METHODS: We have developed the ontology EXACT2 (EXperimental ACTions) that is designed to capture the full semantics of biomedical protocols required for their reproducibility. RESULTS: The paper reports on a fundamentally new version EXACT2 that supports the semantically-defined representation of biomedical protocols. The ability of EXACT2 to capture the semantics of biomedical procedures was verified through a text mining use case. In this EXACT2 is used as a reference model for text mining tools to identify terms pertinent to experimental actions, and their properties, in biomedical protocols expressed in natural language. An EXACT2-based framework for the translation of biomedical protocols to a machine amenable format is proposed. CONCLUSIONS: The EXACT2 ontology is sufficient to record, in a machine processable form, the essential information about biomedical protocols. EXACT2 defines explicit semantics of experimental actions, and can be used by various computer applications. It can serve as a reference model for for the translation of biomedical protocols in natural language into a semantically-defined format.


Asunto(s)
Ontologías Biológicas , Minería de Datos , Programas Informáticos , Procesamiento Automatizado de Datos , Lenguaje , Reproducibilidad de los Resultados , Semántica
9.
J Am Soc Mass Spectrom ; 35(3): 542-550, 2024 Mar 06.
Artículo en Inglés | MEDLINE | ID: mdl-38310603

RESUMEN

Automation is dramatically changing the nature of laboratory life science. Robotic lab hardware that can perform manual operations with greater speed, endurance, and reproducibility opens an avenue for faster scientific discovery with less time spent on laborious repetitive tasks. A major bottleneck remains in integrating cutting-edge laboratory equipment into automated workflows, notably specialized analytical equipment, which is designed for human usage. Here we present AutonoMS, a platform for automatically running, processing, and analyzing high-throughput mass spectrometry experiments. AutonoMS is currently written around an ion mobility mass spectrometry (IM-MS) platform and can be adapted to additional analytical instruments and data processing flows. AutonoMS enables automated software agent-controlled end-to-end measurement and analysis runs from experimental specification files that can be produced by human users or upstream software processes. We demonstrate the use and abilities of AutonoMS in a high-throughput flow-injection ion mobility configuration with 5 s sample analysis time, processing robotically prepared chemical standards and cultured yeast samples in targeted and untargeted metabolomics applications. The platform exhibited consistency, reliability, and ease of use while eliminating the need for human intervention in the process of sample injection, data processing, and analysis. The platform paves the way toward a more fully automated mass spectrometry analysis and ultimately closed-loop laboratory workflows involving automated experimentation and analysis coupled to AI-driven experimentation utilizing cutting-edge analytical instrumentation. AutonoMS documentation is available at https://autonoms.readthedocs.io.


Asunto(s)
Metabolómica , Programas Informáticos , Humanos , Reproducibilidad de los Resultados , Espectrometría de Masas , Automatización
10.
Bioinform Adv ; 3(1): vbad102, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37600845

RESUMEN

Summary: Artificial intelligence (AI)-driven laboratory automation-combining robotic labware and autonomous software agents-is a powerful trend in modern biology. We developed Genesis-DB, a database system designed to support AI-driven autonomous laboratories by providing software agents access to large quantities of structured domain information. In addition, we present a new ontology for modeling data and metadata from autonomously performed yeast microchemostat cultivations in the framework of the Genesis robot scientist system. We show an example of how Genesis-DB enables the research life cycle by modeling yeast gene regulation, guiding future hypotheses generation and design of experiments. Genesis-DB supports AI-driven discovery through automated reasoning and its design is portable, generic, and easily extensible to other AI-driven molecular biology laboratory data and beyond. Availability and implementation: Genesis-DB code and installation instructions are available at the GitHub repository https://github.com/TW-Genesis/genesis-database-system.git. The database use case demo code and data are also available through GitHub (https://github.com/TW-Genesis/genesis-database-demo.git). The ontology can be downloaded here: https://github.com/TW-Genesis/genesis-ontology/releases/download/v0.0.23/genesis.owl. The ontology term descriptions (including mappings to existing ontologies) and maintenance standard operating procedures can be found at: https://github.com/TW-Genesis/genesis-ontology.

11.
NPJ Syst Biol Appl ; 9(1): 11, 2023 04 07.
Artículo en Inglés | MEDLINE | ID: mdl-37029131

RESUMEN

Saccharomyces cerevisiae is a very well studied organism, yet ∼20% of its proteins remain poorly characterized. Moreover, recent studies seem to indicate that the pace of functional discovery is slow. Previous work has implied that the most probable path forward is via not only automation but fully autonomous systems in which active learning is applied to guide high-throughput experimentation. Development of tools and methods for these types of systems is of paramount importance. In this study we use constrained dynamical flux balance analysis (dFBA) to select ten regulatory deletant strains that are likely to have previously unexplored connections to the diauxic shift. We then analyzed these deletant strains using untargeted metabolomics, generating profiles which were then subsequently investigated to better understand the consequences of the gene deletions in the metabolic reconfiguration of the diauxic shift. We show that metabolic profiles can be utilised to not only gaining insight into cellular transformations such as the diauxic shift, but also on regulatory roles and biological consequences of regulatory gene deletion. We also conclude that untargeted metabolomics is a useful tool for guidance in high-throughput model improvement, and is a fast, sensitive and informative approach appropriate for future large-scale functional analyses of genes. Moreover, it is well-suited for automated approaches due to relative simplicity of processing and the potential to make massively high-throughput.


Asunto(s)
Proteínas de Saccharomyces cerevisiae , Saccharomyces cerevisiae , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Metabolómica/métodos
12.
J R Soc Interface ; 19(189): 20210821, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35382578

RESUMEN

Scientific results should not just be 'repeatable' (replicable in the same laboratory under identical conditions), but also 'reproducible' (replicable in other laboratories under similar conditions). Results should also, if possible, be 'robust' (replicable under a wide range of conditions). The reproducibility and robustness of only a small fraction of published biomedical results has been tested; furthermore, when reproducibility is tested, it is often not found. This situation is termed 'the reproducibility crisis', and it is one the most important issues facing biomedicine. This crisis would be solved if it were possible to automate reproducibility testing. Here, we describe the semi-automated testing for reproducibility and robustness of simple statements (propositions) about cancer cell biology automatically extracted from the literature. From 12 260 papers, we automatically extracted statements predicted to describe experimental results regarding a change of gene expression in response to drug treatment in breast cancer, from these we selected 74 statements of high biomedical interest. To test the reproducibility of these statements, two different teams used the laboratory automation system Eve and two breast cancer cell lines (MCF7 and MDA-MB-231). Statistically significant evidence for repeatability was found for 43 statements, and significant evidence for reproducibility/robustness in 22 statements. In two cases, the automation made serendipitous discoveries. The reproduced/robust knowledge provides significant insight into cancer. We conclude that semi-automated reproducibility testing is currently achievable, that it could be scaled up to generate a substantive source of reliable knowledge and that automation has the potential to mitigate the reproducibility crisis.


Asunto(s)
Neoplasias de la Mama , Robótica , Automatización , Biología , Femenino , Humanos , Reproducibilidad de los Resultados
13.
R Soc Open Sci ; 9(5): 211745, 2022 May.
Artículo en Inglés | MEDLINE | ID: mdl-35573039

RESUMEN

The representation of the protein-ligand complexes used in building machine learning models play an important role in the accuracy of binding affinity prediction. The Extended Connectivity Interaction Features (ECIF) is one such representation. We report that (i) including the discretized distances between protein-ligand atom pairs in the ECIF scheme improves predictive accuracy, and (ii) in an evaluation using gradient boosted trees, we found that the resampling method used in selecting the best hyperparameters has a strong effect on predictive performance, especially for benchmarking purposes.

14.
Plant Physiol ; 153(4): 1506-20, 2010 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-20566707

RESUMEN

Metabolite fingerprinting of Arabidopsis (Arabidopsis thaliana) mutants with known or predicted metabolic lesions was performed by (1)H-nuclear magnetic resonance, Fourier transform infrared, and flow injection electrospray-mass spectrometry. Fingerprinting enabled processing of five times more plants than conventional chromatographic profiling and was competitive for discriminating mutants, other than those affected in only low-abundance metabolites. Despite their rapidity and complexity, fingerprints yielded metabolomic insights (e.g. that effects of single lesions were usually not confined to individual pathways). Among fingerprint techniques, (1)H-nuclear magnetic resonance discriminated the most mutant phenotypes from the wild type and Fourier transform infrared discriminated the fewest. To maximize information from fingerprints, data analysis was crucial. One-third of distinctive phenotypes might have been overlooked had data models been confined to principal component analysis score plots. Among several methods tested, machine learning (ML) algorithms, namely support vector machine or random forest (RF) classifiers, were unsurpassed for phenotype discrimination. Support vector machines were often the best performing classifiers, but RFs yielded some particularly informative measures. First, RFs estimated margins between mutant phenotypes, whose relations could then be visualized by Sammon mapping or hierarchical clustering. Second, RFs provided importance scores for the features within fingerprints that discriminated mutants. These scores correlated with analysis of variance F values (as did Kruskal-Wallis tests, true- and false-positive measures, mutual information, and the Relief feature selection algorithm). ML classifiers, as models trained on one data set to predict another, were ideal for focused metabolomic queries, such as the distinctiveness and consistency of mutant phenotypes. Accessible software for use of ML in plant physiology is highlighted.


Asunto(s)
Arabidopsis/metabolismo , Inteligencia Artificial , Metabolómica , Algoritmos , Análisis por Conglomerados , Espectroscopía de Resonancia Magnética , Espectrometría de Masas , Fenotipo , Análisis de Componente Principal , Espectroscopía Infrarroja por Transformada de Fourier
15.
mSystems ; 6(6): e0108721, 2021 Dec 21.
Artículo en Inglés | MEDLINE | ID: mdl-34812651

RESUMEN

The ongoing COVID-19 pandemic urges searches for antiviral agents that can block infection or ameliorate its symptoms. Using dissimilar search strategies for new antivirals will improve our overall chances of finding effective treatments. Here, we have established an experimental platform for screening of small molecule inhibitors of the SARS-CoV-2 main protease in Saccharomyces cerevisiae cells, genetically engineered to enhance cellular uptake of small molecules in the environment. The system consists of a fusion of the Escherichia coli toxin MazF and its antitoxin MazE, with insertion of a protease cleavage site in the linker peptide connecting the MazE and MazF moieties. Expression of the viral protease confers cleavage of the MazEF fusion, releasing the MazF toxin from its antitoxin, resulting in growth inhibition. In the presence of a small molecule inhibiting the protease, cleavage is blocked and the MazF toxin remains inhibited, promoting growth. The system thus allows positive selection for inhibitors. The engineered yeast strain is tagged with a fluorescent marker protein, allowing precise monitoring of its growth in the presence or absence of inhibitor. We detect an established main protease inhibitor by a robust growth increase, discernible down to 1 µM. The system is suitable for robotized large-scale screens. It allows in vivo evaluation of drug candidates and is rapidly adaptable for new variants of the protease with deviant site specificities. IMPORTANCE The COVID-19 pandemic may continue for several years before vaccination campaigns can put an end to it globally. Thus, the need for discovery of new antiviral drug candidates will remain. We have engineered a system in yeast cells for the detection of small molecule inhibitors of one attractive drug target of SARS-CoV-2, its main protease, which is required for viral replication. The ability to detect inhibitors in live cells brings the advantage that only compounds capable of entering the cell and remain stable there will score in the system. Moreover, because of its design in yeast cells, the system is rapidly adaptable for tuning the detection level and eventual modification of the protease cleavage site in the case of future mutant variants of the SARS-CoV-2 main protease or even for other proteases.

16.
Bioinformatics ; 25(16): 2020-7, 2009 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-19535531

RESUMEN

MOTIVATION: Distribution analysis is one of the most basic forms of statistical analysis. Thanks to improved analytical methods, accurate and extensive quantitative measurements can now be made of the mRNA, protein and metabolite from biological systems. Here, we report a large-scale analysis of the population abundance distributions of the transcriptomes, proteomes and metabolomes from varied biological systems. RESULTS: We compared the observed empirical distributions with a number of distributions: power law, lognormal, loglogistic, loggamma, right Pareto-lognormal (PLN) and double PLN (dPLN). The best-fit for mRNA, protein and metabolite population abundance distributions was found to be the dPLN. This distribution behaves like a lognormal distribution around the centre, and like a power law distribution in the tails. To better understand the cause of this observed distribution, we explored a simple stochastic model based on geometric Brownian motion. The distribution indicates that multiplicative effects are causally dominant in biological systems. We speculate that these effects arise from chemical reactions: the central-limit theorem then explains the central lognormal, and a number of possible mechanisms could explain the long tails: positive feedback, network topology, etc. Many of the components in the central lognormal parts of the empirical distributions are unidentified and/or have unknown function. This indicates that much more biology awaits discovery.


Asunto(s)
Metaboloma , Proteínas/metabolismo , ARN Mensajero/metabolismo , Simulación por Computador , Modelos Biológicos , Proteínas/química
17.
Nature ; 427(6971): 247-52, 2004 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-14724639

RESUMEN

The question of whether it is possible to automate the scientific process is of both great theoretical interest and increasing practical importance because, in many scientific areas, data are being generated much faster than they can be effectively analysed. We describe a physically implemented robotic system that applies techniques from artificial intelligence to carry out cycles of scientific experimentation. The system automatically originates hypotheses to explain observations, devises experiments to test these hypotheses, physically runs the experiments using a laboratory robot, interprets the results to falsify hypotheses inconsistent with the data, and then repeats the cycle. Here we apply the system to the determination of gene function using deletion mutants of yeast (Saccharomyces cerevisiae) and auxotrophic growth experiments. We built and tested a detailed logical model (involving genes, proteins and metabolites) of the aromatic amino acid synthesis pathway. In biological experiments that automatically reconstruct parts of this model, we show that an intelligent experiment selection strategy is competitive with human performance and significantly outperforms, with a cost decrease of 3-fold and 100-fold (respectively), both cheapest and random-experiment selection.


Asunto(s)
Genómica/instrumentación , Genómica/métodos , Modelos Biológicos , Proyectos de Investigación , Investigadores/estadística & datos numéricos , Investigación/instrumentación , Robótica/métodos , Algoritmos , Aminoácidos/biosíntesis , Biología Computacional , Simulación por Computador , Análisis Costo-Beneficio , Eficiencia , Eliminación de Gen , Genes Fúngicos/genética , Humanos , Aprendizaje , Sistemas de Lectura Abierta , Fenotipo , Probabilidad , Investigadores/normas , Robótica/instrumentación , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Programas Informáticos , Factores de Tiempo , Recursos Humanos
18.
Mach Learn ; 109(2): 251-277, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32174648

RESUMEN

In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.

19.
Bioinformatics ; 24(13): i295-303, 2008 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-18586727

RESUMEN

MOTIVATION: Many published manuscripts contain experiment protocols which are poorly described or deficient in information. This means that the published results are very hard or impossible to repeat. This problem is being made worse by the increasing complexity of high-throughput/automated methods. There is therefore a growing need to represent experiment protocols in an efficient and unambiguous way. RESULTS: We have developed the Experiment ACTions (EXACT) ontology as the basis of a method of representing biological laboratory protocols. We provide example protocols that have been formalized using EXACT, and demonstrate the advantages and opportunities created by using this formalization. We argue that the use of EXACT will result in the publication of protocols with increased clarity and usefulness to the scientific community. AVAILABILITY: The ontology, examples and code can be downloaded from http://www.aber.ac.uk/compsci/Research/bio/dss/EXACT/.


Asunto(s)
Sistemas de Administración de Bases de Datos , Bases de Datos Factuales , Documentación/métodos , Almacenamiento y Recuperación de la Información/métodos , Internet , Investigación/clasificación , Investigación/normas
20.
J Cheminform ; 11(1): 68, 2019 Nov 12.
Artículo en Inglés | MEDLINE | ID: mdl-33430958

RESUMEN

The goal of quantitative structure activity relationship (QSAR) learning is to learn a function that, given the structure of a small molecule (a potential drug), outputs the predicted activity of the compound. We employed multi-task learning (MTL) to exploit commonalities in drug targets and assays. We used datasets containing curated records about the activity of specific compounds on drug targets provided by ChEMBL. Totally, 1091 assays have been analysed. As a baseline, a single task learning approach that trains random forest to predict drug activity for each drug target individually was considered. We then carried out feature-based and instance-based MTL to predict drug activities. We introduced a natural metric of evolutionary distance between drug targets as a measure of tasks relatedness. Instance-based MTL significantly outperformed both, feature-based MTL and the base learner, on 741 drug targets out of 1091. Feature-based MTL won on 179 occasions and the base learner performed best on 171 drug targets. We conclude that MTL QSAR is improved by incorporating the evolutionary distance between targets. These results indicate that QSAR learning can be performed effectively, even if little data is available for specific drug targets, by leveraging what is known about similar drug targets.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA