Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 127
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 40(Suppl 1): i401-i409, 2024 06 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940168

RESUMEN

Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets. AVAILABILITY AND IMPLEMENTATION: Data and code are available at https://github.com/bio-ontology-research-group/PU-GO.


Asunto(s)
Biología Computacional , Ontología de Genes , Proteínas , Proteínas/química , Proteínas/metabolismo , Biología Computacional/métodos , Bases de Datos de Proteínas , Algoritmos
2.
Bioinformatics ; 40(5)2024 05 02.
Artículo en Inglés | MEDLINE | ID: mdl-38696757

RESUMEN

MOTIVATION: Whole-exome and genome sequencing have become common tools in diagnosing patients with rare diseases. Despite their success, this approach leaves many patients undiagnosed. A common argument is that more disease variants still await discovery, or the novelty of disease phenotypes results from a combination of variants in multiple disease-related genes. Interpreting the phenotypic consequences of genomic variants relies on information about gene functions, gene expression, physiology, and other genomic features. Phenotype-based methods to identify variants involved in genetic diseases combine molecular features with prior knowledge about the phenotypic consequences of altering gene functions. While phenotype-based methods have been successfully applied to prioritizing variants, such methods are based on known gene-disease or gene-phenotype associations as training data and are applicable to genes that have phenotypes associated, thereby limiting their scope. In addition, phenotypes are not assigned uniformly by different clinicians, and phenotype-based methods need to account for this variability. RESULTS: We developed an Embedding-based Phenotype Variant Predictor (EmbedPVP), a computational method to prioritize variants involved in genetic diseases by combining genomic information and clinical phenotypes. EmbedPVP leverages a large amount of background knowledge from human and model organisms about molecular mechanisms through which abnormal phenotypes may arise. Specifically, EmbedPVP incorporates phenotypes linked to genes, functions of gene products, and the anatomical site of gene expression, and systematically relates them to their phenotypic effects through neuro-symbolic, knowledge-enhanced machine learning. We demonstrate EmbedPVP's efficacy on a large set of synthetic genomes and genomes matched with clinical information. AVAILABILITY AND IMPLEMENTATION: EmbedPVP and all evaluation experiments are freely available at https://github.com/bio-ontology-research-group/EmbedPVP.


Asunto(s)
Genómica , Humanos , Genómica/métodos , Fenotipo , Variación Genética , Biología Computacional/métodos , Aprendizaje Automático
3.
Bioinformatics ; 2024 Oct 26.
Artículo en Inglés | MEDLINE | ID: mdl-39460944

RESUMEN

MOTIVATION: Identifying causal relations between diseases allows for the study of shared pathways, biological mechanisms, and inter-disease risks. Such causal relations can facilitate the identification of potential disease precursors and candidates for drug re-purposing. However, computational methods often lack access to these causal relations. Few approaches have been developed to automatically extract causal relationships between diseases from unstructured text, but they are often only focused on a small number of diseases, lack validation of the extracted causal relations, or do not make their data available. RESULTS: We automatically mined statements asserting a causal relation between diseases from the scientific literature by leveraging lexical patterns. Following automated mining of causal relations, we mapped the diseases to the International Classification of Diseases (ICD) identifiers to allow the direct application to clinical data. We provide quantitative and qualitative measures to evaluate the mined causal relations and compare to UK Biobank (UKB) diagnosis data as a completely independent data source. The validated causal associations were used to create a directed acyclic graph that can be used by causal inference frameworks. We demonstrate the utility of our causal network by performing causal inference using the do-calculus, using relations within the graph to construct and improve polygenic risk scores, and disentangle the pleiotropic effects of variants. AVAILABILITY AND IMPLEMENTATION: The data is available through https://github.com/bio-ontology-research-group/causal-relations-between-diseases. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

4.
Hum Genomics ; 18(1): 44, 2024 Apr 29.
Artículo en Inglés | MEDLINE | ID: mdl-38685113

RESUMEN

BACKGROUND: A major obstacle faced by families with rare diseases is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years and causal variants are identified in under 50%, even when capturing variants genome-wide. To aid in the interpretation and prioritization of the vast number of variants detected, computational methods are proliferating. Knowing which tools are most effective remains unclear. To evaluate the performance of computational methods, and to encourage innovation in method development, we designed a Critical Assessment of Genome Interpretation (CAGI) community challenge to place variant prioritization models head-to-head in a real-life clinical diagnostic setting. METHODS: We utilized genome sequencing (GS) data from families sequenced in the Rare Genomes Project (RGP), a direct-to-participant research study on the utility of GS for rare disease diagnosis and gene discovery. Challenge predictors were provided with a dataset of variant calls and phenotype terms from 175 RGP individuals (65 families), including 35 solved training set families with causal variants specified, and 30 unlabeled test set families (14 solved, 16 unsolved). We tasked teams to identify causal variants in as many families as possible. Predictors submitted variant predictions with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on the rank position of causal variants, and the maximum F-measure, based on precision and recall of causal variants across all EPCR values. RESULTS: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performers recalled causal variants in up to 13 of 14 solved families within the top 5 ranked variants. Newly discovered diagnostic variants were returned to two previously unsolved families following confirmatory RNA sequencing, and two novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant in an unsolved proband with phenotypes consistent with asparagine synthetase deficiency. CONCLUSIONS: Model methodology and performance was highly variable. Models weighing call quality, allele frequency, predicted deleteriousness, segregation, and phenotype were effective in identifying causal variants, and models open to phenotype expansion and non-coding variants were able to capture more difficult diagnoses and discover new diagnoses. Overall, computational models can significantly aid variant prioritization. For use in diagnostics, detailed review and conservative assessment of prioritized variants against established criteria is needed.


Asunto(s)
Enfermedades Raras , Humanos , Enfermedades Raras/genética , Enfermedades Raras/diagnóstico , Genoma Humano/genética , Variación Genética/genética , Biología Computacional/métodos , Fenotipo
5.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36534832

RESUMEN

MOTIVATION: Ontologies contain formal and structured information about a domain and are widely used in bioinformatics for annotation and integration of data. Several methods use ontologies to provide background knowledge in machine learning tasks, which is of particular importance in bioinformatics. These methods rely on a set of common primitives that are not readily available in a software library; a library providing these primitives would facilitate the use of current machine learning methods with ontologies and the development of novel methods for other ontology-based biomedical applications. RESULTS: We developed mOWL, a Python library for machine learning with ontologies formalized in the Web Ontology Language (OWL). mOWL implements ontology embedding methods that map information contained in formal knowledge bases and ontologies into vector spaces while preserving some of the properties and relations in ontologies, as well as methods to use these embeddings for similarity computation, deductive inference and zero-shot learning. We demonstrate mOWL on the knowledge-based prediction of protein-protein interactions using the gene ontology and gene-disease associations using phenotype ontologies. AVAILABILITY AND IMPLEMENTATION: mOWL is freely available on https://github.com/bio-ontology-research-group/mowl and as a Python package in PyPi. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Ontologías Biológicas , Ontología de Genes , Bases del Conocimiento , Programas Informáticos , Aprendizaje Automático
6.
BMC Bioinformatics ; 24(1): 294, 2023 Jul 21.
Artículo en Inglés | MEDLINE | ID: mdl-37479972

RESUMEN

BACKGROUND: Identifying variants associated with diseases is a challenging task in medical genetics research. Current studies that prioritize variants within individual genomes generally rely on known variants, evidence from literature and genomes, and patient symptoms and clinical signs. The functionalities of the existing tools, which rank variants based on given patient symptoms and clinical signs, are restricted to the coverage of ontologies such as the Human Phenotype Ontology (HPO). However, most clinicians do not limit themselves to HPO while describing patient symptoms/signs and their associated variants/genes. There is thus a need for an automated tool that can prioritize variants based on freely expressed patient symptoms and clinical signs. RESULTS: STARVar is a Symptom-based Tool for Automatic Ranking of Variants using evidence from literature and genomes. STARVar uses patient symptoms and clinical signs, either linked to HPO or expressed in free text format. It returns a ranked list of variants based on a combined score from two classifiers utilizing evidence from genomics and literature. STARVar improves over related tools on a set of synthetic patients. In addition, we demonstrated its distinct contribution to the domain on another synthetic dataset covering publicly available clinical genotype-phenotype associations by using symptoms and clinical signs expressed in free text format. CONCLUSIONS: STARVar stands as a unique and efficient tool that has the advantage of ranking variants with flexibly expressed patient symptoms in free-form text. Therefore, STARVar can be easily integrated into bioinformatics workflows designed to analyze disease-associated genomes. AVAILABILITY: STARVar is freely available from https://github.com/bio-ontology-research-group/STARVar .


Asunto(s)
Genómica , Programas Informáticos , Humanos , Fenotipo , Biología Computacional , Estudios de Asociación Genética
7.
Brief Bioinform ; 22(4)2021 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-33049044

RESUMEN

Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.


Asunto(s)
Ontologías Biológicas , Aprendizaje Automático , Modelos Biológicos , Semántica
8.
Bioinformatics ; 38(Suppl 1): i238-i245, 2022 06 24.
Artículo en Inglés | MEDLINE | ID: mdl-35758802

RESUMEN

MOTIVATION: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50 000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require a significant amount of training data and cannot make predictions for GO classes that have only few or no experimental annotations. RESULTS: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. AVAILABILITY AND IMPLEMENTATION: http://github.com/bio-ontology-research-group/deepgozero. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional , Proteínas , Biología Computacional/métodos , Ontología de Genes , Aprendizaje Automático , Proteínas/metabolismo
9.
Bioinformatics ; 38(6): 1677-1684, 2022 03 04.
Artículo en Inglés | MEDLINE | ID: mdl-34951628

RESUMEN

MOTIVATION: Structural genomic variants account for much of human variability and are involved in several diseases. Structural variants are complex and may affect coding regions of multiple genes, or affect the functions of genomic regions in different ways from single nucleotide variants. Interpreting the phenotypic consequences of structural variants relies on information about gene functions, haploinsufficiency or triplosensitivity and other genomic features. Phenotype-based methods to identifying variants that are involved in genetic diseases combine molecular features with prior knowledge about the phenotypic consequences of altering gene functions. While phenotype-based methods have been applied successfully to single nucleotide variants as well as short insertions and deletions, the complexity of structural variants makes it more challenging to link them to phenotypes. Furthermore, structural variants can affect a large number of coding regions, and phenotype information may not be available for all of them. RESULTS: We developed DeepSVP, a computational method to prioritize structural variants involved in genetic diseases by combining genomic and gene functions information. We incorporate phenotypes linked to genes, functions of gene products, gene expression in individual cell types and anatomical sites of expression, and systematically relate them to their phenotypic consequences through ontologies and machine learning. DeepSVP significantly improves the success rate of finding causative variants in several benchmarks and can identify novel pathogenic structural variants in consanguineous families. AVAILABILITY AND IMPLEMENTATION: https://github.com/bio-ontology-research-group/DeepSVP. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Profundo , Humanos , Genotipo , Fenotipo , Genómica , Nucleótidos
10.
Nucleic Acids Res ; 49(W1): W140-W146, 2021 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-34019664

RESUMEN

Understanding the functions of proteins is crucial to understand biological processes on a molecular level. Many more protein sequences are available than can be investigated experimentally. DeepGOPlus is a protein function prediction method based on deep learning and sequence similarity. DeepGOWeb makes the prediction model available through a website, an API, and through the SPARQL query language for interoperability with databases that rely on Semantic Web technologies. DeepGOWeb provides accurate and fast predictions and ensures that predicted functions are consistent with the Gene Ontology; it can provide predictions for any protein and any function in Gene Ontology. DeepGOWeb is freely available at https://deepgo.cbrc.kaust.edu.sa/.


Asunto(s)
Proteínas/fisiología , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Aprendizaje Profundo , Ontología de Genes , Internet , Proteínas/química , Proteínas/genética
11.
Bioinformatics ; 37(24): 4835-4843, 2021 12 11.
Artículo en Inglés | MEDLINE | ID: mdl-34320178

RESUMEN

MOTIVATION: In silico drug-target interaction (DTI) prediction is important for drug discovery and drug repurposing. Approaches to predict DTIs can proceed indirectly, top-down, using phenotypic effects of drugs to identify potential drug targets, or they can be direct, bottom-up and use molecular information to directly predict binding affinities. Both approaches can be combined with information about interaction networks. RESULTS: We developed DTI-Voodoo as a computational method that combines molecular features and ontology-encoded phenotypic effects of drugs with protein-protein interaction networks, and uses a graph convolutional neural network to predict DTIs. We demonstrate that drug effect features can exploit information in the interaction network whereas molecular features do not. DTI-Voodoo is designed to predict candidate drugs for a given protein; we use this formulation to show that common DTI datasets contain intrinsic biases with major effects on performance evaluation and comparison of DTI prediction methods. Using a modified evaluation scheme, we demonstrate that DTI-Voodoo improves significantly over state of the art DTI prediction methods. AVAILABILITY AND IMPLEMENTATION: DTI-Voodoo source code and data necessary to reproduce results are freely available at https://github.com/THinnerichs/DTI-VOODOO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Automático , Programas Informáticos , Desarrollo de Medicamentos/métodos , Redes Neurales de la Computación , Proteínas/química
12.
Bioinformatics ; 37(6): 853-860, 2021 05 05.
Artículo en Inglés | MEDLINE | ID: mdl-33051643

RESUMEN

MOTIVATION: Over the past years, many computational methods have been developed to incorporate information about phenotypes for disease-gene prioritization task. These methods generally compute the similarity between a patient's phenotypes and a database of gene-phenotype to find the most phenotypically similar match. The main limitation in these methods is their reliance on knowledge about phenotypes associated with particular genes, which is not complete in humans as well as in many model organisms, such as the mouse and fish. Information about functions of gene products and anatomical site of gene expression is available for more genes and can also be related to phenotypes through ontologies and machine-learning models. RESULTS: We developed a novel graph-based machine-learning method for biomedical ontologies, which is able to exploit axioms in ontologies and other graph-structured data. Using our machine-learning method, we embed genes based on their associated phenotypes, functions of the gene products and anatomical location of gene expression. We then develop a machine-learning model to predict gene-disease associations based on the associations between genes and multiple biomedical ontologies, and this model significantly improves over state-of-the-art methods. Furthermore, we extend phenotype-based gene prioritization methods significantly to all genes, which are associated with phenotypes, functions or site of expression. AVAILABILITY AND IMPLEMENTATION: Software and data are available at https://github.com/bio-ontology-research-group/DL2Vec. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Ontologías Biológicas , Animales , Manejo de Datos , Aprendizaje Automático , Ratones , Fenotipo , Programas Informáticos
13.
Bioinformatics ; 37(17): 2722-2729, 2021 Sep 09.
Artículo en Inglés | MEDLINE | ID: mdl-33682875

RESUMEN

MOTIVATION: Infectious diseases caused by novel viruses have become a major public health concern. Rapid identification of virus-host interactions can reveal mechanistic insights into infectious diseases and shed light on potential treatments. Current computational prediction methods for novel viruses are based mainly on protein sequences. However, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e. signs and symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts. RESULTS: We developed DeepViral, a deep learning based method that predicts protein-protein interactions (PPI) between humans and viruses. Motivated by the potential utility of infectious disease phenotypes, we first embedded human proteins and viruses in a shared space using their associated phenotypes and functions, supported by formalized background knowledge from biomedical ontologies. By jointly learning from protein sequences and phenotype features, DeepViral significantly improves over existing sequence-based methods for intra- and inter-species PPI prediction. AVAILABILITY AND IMPLEMENTATION: Code and datasets for reproduction and customization are available at https://github.com/bio-ontology-research-group/DeepViral. Prediction results for 14 virus families are available at https://doi.org/10.5281/zenodo.4429824. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

14.
BMC Med Inform Decis Mak ; 22(1): 33, 2022 02 05.
Artículo en Inglés | MEDLINE | ID: mdl-35123470

RESUMEN

BACKGROUND: Semantic similarity is a valuable tool for analysis in biomedicine. When applied to phenotype profiles derived from clinical text, they have the capacity to enable and enhance 'patient-like me' analyses, automated coding, differential diagnosis, and outcome prediction. While a large body of work exists exploring the use of semantic similarity for multiple tasks, including protein interaction prediction, and rare disease differential diagnosis, there is less work exploring comparison of patient phenotype profiles for clinical tasks. Moreover, there are no experimental explorations of optimal parameters or better methods in the area. METHODS: We develop a platform for reproducible benchmarking and comparison of experimental conditions for patient phentoype similarity. Using the platform, we evaluate the task of ranking shared primary diagnosis from uncurated phenotype profiles derived from all text narrative associated with admissions in the medical information mart for intensive care (MIMIC-III). RESULTS: 300 semantic similarity configurations were evaluated, as well as one embedding-based approach. On average, measures that did not make use of an external information content measure performed slightly better, however the best-performing configurations when measured by area under receiver operating characteristic curve and Top Ten Accuracy used term-specificity and annotation-frequency measures. CONCLUSION: We identified and interpreted the performance of a large number of semantic similarity configurations for the task of classifying diagnosis from text-derived phenotype profiles in one setting. We also provided a basis for further research on other settings and related tasks in the area.


Asunto(s)
Enfermedades Raras , Semántica , Humanos , Fenotipo , Curva ROC
15.
Bioinformatics ; 36(2): 422-429, 2020 01 15.
Artículo en Inglés | MEDLINE | ID: mdl-31350877

RESUMEN

MOTIVATION: Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein-protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time. RESULTS: We developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins (if available). We evaluate the performance of DeepGOPlus using the CAFA3 evaluation measures and achieve an Fmax of 0.390, 0.557 and 0.614 for BPO, MFO and CCO evaluations, respectively. These results would have made DeepGOPlus one of the three best predictors in CCO and the second best performing method in the BPO and MFO evaluations. We also compare DeepGOPlus with state-of-the-art methods such as DeepText2GO and GOLabeler on another dataset. DeepGOPlus can annotate around 40 protein sequences per second on common hardware, thereby making fast and accurate function predictions available for a wide range of proteins. AVAILABILITY AND IMPLEMENTATION: http://deepgoplus.bio2vec.net/ . SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Redes Neurales de la Computación , Proteínas , Secuencia de Aminoácidos , Mapas de Interacción de Proteínas
16.
Bioinformatics ; 36(7): 2229-2236, 2020 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-31821406

RESUMEN

MOTIVATION: Over the past years, significant resources have been invested into formalizing biomedical ontologies. Formal axioms in ontologies have been developed and used to detect and ensure ontology consistency, find unsatisfiable classes, improve interoperability, guide ontology extension through the application of axiom-based design patterns and encode domain background knowledge. The domain knowledge of biomedical ontologies may have also the potential to provide background knowledge for machine learning and predictive modelling. RESULTS: We use ontology-based machine learning methods to evaluate the contribution of formal axioms and ontology meta-data to the prediction of protein-protein interactions and gene-disease associations. We find that the background knowledge provided by the Gene Ontology and other ontologies significantly improves the performance of ontology-based prediction models through provision of domain-specific background knowledge. Furthermore, we find that the labels, synonyms and definitions in ontologies can also provide background knowledge that may be exploited for prediction. The axioms and meta-data of different ontologies contribute to improving data analysis in a context-specific manner. Our results have implications on the further development of formal knowledge bases and ontologies in the life sciences, in particular as machine learning methods are more frequently being applied. Our findings motivate the need for further development, and the systematic, application-driven evaluation and improvement, of formal axioms in ontologies. AVAILABILITY AND IMPLEMENTATION: https://github.com/bio-ontology-research-group/tsoe. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Ontologías Biológicas , Ontología de Genes , Bases del Conocimiento , Aprendizaje Automático
17.
PLoS Comput Biol ; 16(11): e1008453, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33206638

RESUMEN

Predicting the phenotypes resulting from molecular perturbations is one of the key challenges in genetics. Both forward and reverse genetic screen are employed to identify the molecular mechanisms underlying phenotypes and disease, and these resulted in a large number of genotype-phenotype association being available for humans and model organisms. Combined with recent advances in machine learning, it may now be possible to predict human phenotypes resulting from particular molecular aberrations. We developed DeepPheno, a neural network based hierarchical multi-class multi-label classification method for predicting the phenotypes resulting from loss-of-function in single genes. DeepPheno uses the functional annotations with gene products to predict the phenotypes resulting from a loss-of-function; additionally, we employ a two-step procedure in which we predict these functions first and then predict phenotypes. Prediction of phenotypes is ontology-based and we propose a novel ontology-based classifier suitable for very large hierarchical classification tasks. These methods allow us to predict phenotypes associated with any known protein-coding gene. We evaluate our approach using evaluation metrics established by the CAFA challenge and compare with top performing CAFA2 methods as well as several state of the art phenotype prediction approaches, demonstrating the improvement of DeepPheno over established methods. Furthermore, we show that predictions generated by DeepPheno are applicable to predicting gene-disease associations based on comparing phenotypes, and that a large number of new predictions made by DeepPheno have recently been added as phenotype databases.


Asunto(s)
Estudios de Asociación Genética/estadística & datos numéricos , Mutación con Pérdida de Función , Redes Neurales de la Computación , Fenotipo , Animales , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Aprendizaje Profundo , Ontología de Genes/estadística & datos numéricos , Predisposición Genética a la Enfermedad , Humanos
18.
Brief Bioinform ; 19(5): 1008-1021, 2018 09 28.
Artículo en Inglés | MEDLINE | ID: mdl-28387809

RESUMEN

The past decade has seen an explosion in the collection of genotype data in domains as diverse as medicine, ecology, livestock and plant breeding. Along with this comes the challenge of dealing with the related phenotype data, which is not only large but also highly multidimensional. Computational analysis of phenotypes has therefore become critical for our ability to understand the biological meaning of genomic data in the biological sciences. At the heart of computational phenotype analysis are the phenotype ontologies. A large number of these ontologies have been developed across many domains, and we are now at a point where the knowledge captured in the structure of these ontologies can be used for the integration and analysis of large interrelated data sets. The Phenotype And Trait Ontology framework provides a method for formal definitions of phenotypes and associated data sets and has proved to be key to our ability to develop methods for the integration and analysis of phenotype data. Here, we describe the development and products of the ontological approach to phenotype capture, the formal content of phenotype ontologies and how their content can be used computationally.


Asunto(s)
Ontologías Biológicas/estadística & datos numéricos , Fenotipo , Animales , Animales Domésticos , Biodiversidad , Evolución Biológica , Biología Computacional/métodos , Ecología , Ontología de Genes/estadística & datos numéricos , Humanos , Web Semántica
19.
Brief Bioinform ; 19(1): 77-88, 2018 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-27742665

RESUMEN

Systems biology models are rapidly increasing in complexity, size and numbers. When building large models, researchers rely on software tools for the retrieval, comparison, combination and merging of models, as well as for version control. These tools need to be able to quantify the differences and similarities between computational models. However, depending on the specific application, the notion of 'similarity' may greatly vary. A general notion of model similarity, applicable to various types of models, is still missing. Here we survey existing methods for the comparison of models, introduce quantitative measures for model similarity, and discuss potential applications of combined similarity measures. To frame model comparison as a general problem, we describe a theoretical approach to defining and computing similarities based on a combination of different model aspects. The six aspects that we define as potentially relevant for similarity are underlying encoding, references to biological entities, quantitative behaviour, qualitative behaviour, mathematical equations and parameters and network structure. We argue that future similarity measures will benefit from combining these model aspects in flexible, problem-specific ways to mimic users' intuition about model similarity, and to support complex model searches in databases.


Asunto(s)
Biología Computacional/métodos , Simulación por Computador , Modelos Biológicos , Programas Informáticos , Biología de Sistemas/métodos , Animales , Bases de Datos Factuales , Humanos , Transducción de Señal , Interfaz Usuario-Computador
20.
Bioinformatics ; 35(12): 2133-2140, 2019 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-30407490

RESUMEN

MOTIVATION: Ontologies are widely used in biology for data annotation, integration and analysis. In addition to formally structured axioms, ontologies contain meta-data in the form of annotation axioms which provide valuable pieces of information that characterize ontology classes. Annotation axioms commonly used in ontologies include class labels, descriptions or synonyms. Despite being a rich source of semantic information, the ontology meta-data are generally unexploited by ontology-based analysis methods such as semantic similarity measures. RESULTS: We propose a novel method, OPA2Vec, to generate vector representations of biological entities in ontologies by combining formal ontology axioms and annotation axioms from the ontology meta-data. We apply a Word2Vec model that has been pre-trained on either a corpus or abstracts or full-text articles to produce feature vectors from our collected data. We validate our method in two different ways: first, we use the obtained vector representations of proteins in a similarity measure to predict protein-protein interaction on two different datasets. Second, we evaluate our method on predicting gene-disease associations based on phenotype similarity by generating vector representations of genes and diseases using a phenotype ontology, and applying the obtained vectors to predict gene-disease associations using mouse model phenotypes. We demonstrate that OPA2Vec significantly outperforms existing methods for predicting gene-disease associations. Using evidence from mouse models, we apply OPA2Vec to identify candidate genes for several thousand rare and orphan diseases. OPA2Vec can be used to produce vector representations of any biomedical entity given any type of biomedical ontology. AVAILABILITY AND IMPLEMENTATION: https://github.com/bio-ontology-research-group/opa2vec. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Ontologías Biológicas , Animales , Ratones , Fenotipo , Semántica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA