Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
Bioinformatics ; 40(Supplement_1): i401-i409, 2024 Jun 28.
Article in English | MEDLINE | ID: mdl-38940168

ABSTRACT

Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets. AVAILABILITY AND IMPLEMENTATION: Data and code are available at https://github.com/bio-ontology-research-group/PU-GO.


Subject(s)
Computational Biology , Gene Ontology , Proteins , Proteins/chemistry , Proteins/metabolism , Computational Biology/methods , Databases, Protein , Algorithms
2.
Res Sq ; 2023 Aug 02.
Article in English | MEDLINE | ID: mdl-37577579

ABSTRACT

In the context of the Critical Assessment of the Genome Interpretation, 6th edition (CAGI6), the Genetics of Neurodevelopmental Disorders Lab in Padua proposed a new ID-challenge to give the opportunity of developing computational methods for predicting patient's phenotype and the causal variants. Eight research teams and 30 models had access to the phenotype details and real genetic data, based on the sequences of 74 genes (VCF format) in 415 pediatric patients affected by Neurodevelopmental Disorders (NDDs). NDDs are clinically and genetically heterogeneous conditions, with onset in infant age. In this study we evaluate the ability and accuracy of computational methods to predict comorbid phenotypes based on clinical features described in each patient and causal variants. Finally, we asked to develop a method to find new possible genetic causes for patients without a genetic diagnosis. As already done for the CAGI5, seven clinical features (ID, ASD, ataxia, epilepsy, microcephaly, macrocephaly, hypotonia), and variants (causative, putative pathogenic and contributing factors) were provided. Considering the overall clinical manifestation of our cohort, we give out the variant data and phenotypic traits of the 150 patients from CAGI5 ID-Challenge as training and validation for the prediction methods development.

3.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36534832

ABSTRACT

MOTIVATION: Ontologies contain formal and structured information about a domain and are widely used in bioinformatics for annotation and integration of data. Several methods use ontologies to provide background knowledge in machine learning tasks, which is of particular importance in bioinformatics. These methods rely on a set of common primitives that are not readily available in a software library; a library providing these primitives would facilitate the use of current machine learning methods with ontologies and the development of novel methods for other ontology-based biomedical applications. RESULTS: We developed mOWL, a Python library for machine learning with ontologies formalized in the Web Ontology Language (OWL). mOWL implements ontology embedding methods that map information contained in formal knowledge bases and ontologies into vector spaces while preserving some of the properties and relations in ontologies, as well as methods to use these embeddings for similarity computation, deductive inference and zero-shot learning. We demonstrate mOWL on the knowledge-based prediction of protein-protein interactions using the gene ontology and gene-disease associations using phenotype ontologies. AVAILABILITY AND IMPLEMENTATION: mOWL is freely available on https://github.com/bio-ontology-research-group/mowl and as a Python package in PyPi. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Biological Ontologies , Gene Ontology , Knowledge Bases , Software , Machine Learning
4.
Bioinformatics ; 38(Suppl 1): i238-i245, 2022 06 24.
Article in English | MEDLINE | ID: mdl-35758802

ABSTRACT

MOTIVATION: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50 000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require a significant amount of training data and cannot make predictions for GO classes that have only few or no experimental annotations. RESULTS: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. AVAILABILITY AND IMPLEMENTATION: http://github.com/bio-ontology-research-group/deepgozero. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Proteins , Computational Biology/methods , Gene Ontology , Machine Learning , Proteins/metabolism
5.
Nucleic Acids Res ; 49(W1): W140-W146, 2021 07 02.
Article in English | MEDLINE | ID: mdl-34019664

ABSTRACT

Understanding the functions of proteins is crucial to understand biological processes on a molecular level. Many more protein sequences are available than can be investigated experimentally. DeepGOPlus is a protein function prediction method based on deep learning and sequence similarity. DeepGOWeb makes the prediction model available through a website, an API, and through the SPARQL query language for interoperability with databases that rely on Semantic Web technologies. DeepGOWeb provides accurate and fast predictions and ensures that predicted functions are consistent with the Gene Ontology; it can provide predictions for any protein and any function in Gene Ontology. DeepGOWeb is freely available at https://deepgo.cbrc.kaust.edu.sa/.


Subject(s)
Proteins/physiology , Sequence Analysis, Protein/methods , Software , Deep Learning , Gene Ontology , Internet , Proteins/chemistry , Proteins/genetics
6.
Bioinformatics ; 37(8): 1187, 2021 May 23.
Article in English | MEDLINE | ID: mdl-34009304
7.
Brief Bioinform ; 22(4)2021 07 20.
Article in English | MEDLINE | ID: mdl-33049044

ABSTRACT

Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.


Subject(s)
Biological Ontologies , Machine Learning , Models, Biological , Semantics
8.
PLoS Comput Biol ; 16(11): e1008453, 2020 11.
Article in English | MEDLINE | ID: mdl-33206638

ABSTRACT

Predicting the phenotypes resulting from molecular perturbations is one of the key challenges in genetics. Both forward and reverse genetic screen are employed to identify the molecular mechanisms underlying phenotypes and disease, and these resulted in a large number of genotype-phenotype association being available for humans and model organisms. Combined with recent advances in machine learning, it may now be possible to predict human phenotypes resulting from particular molecular aberrations. We developed DeepPheno, a neural network based hierarchical multi-class multi-label classification method for predicting the phenotypes resulting from loss-of-function in single genes. DeepPheno uses the functional annotations with gene products to predict the phenotypes resulting from a loss-of-function; additionally, we employ a two-step procedure in which we predict these functions first and then predict phenotypes. Prediction of phenotypes is ontology-based and we propose a novel ontology-based classifier suitable for very large hierarchical classification tasks. These methods allow us to predict phenotypes associated with any known protein-coding gene. We evaluate our approach using evaluation metrics established by the CAFA challenge and compare with top performing CAFA2 methods as well as several state of the art phenotype prediction approaches, demonstrating the improvement of DeepPheno over established methods. Furthermore, we show that predictions generated by DeepPheno are applicable to predicting gene-disease associations based on comparing phenotypes, and that a large number of new predictions made by DeepPheno have recently been added as phenotype databases.


Subject(s)
Genetic Association Studies/statistics & numerical data , Loss of Function Mutation , Neural Networks, Computer , Phenotype , Animals , Computational Biology , Databases, Genetic/statistics & numerical data , Deep Learning , Gene Ontology/statistics & numerical data , Genetic Predisposition to Disease , Humans
9.
Article in English | MEDLINE | ID: mdl-32850499

ABSTRACT

The spread of the novel coronavirus (SARS-CoV-2) has triggered a global emergency, that demands urgent solutions for detection and therapy to prevent escalating health, social, and economic impacts. The spike protein (S) of this virus enables binding to the human receptor ACE2, and hence presents a prime target for vaccines preventing viral entry into host cells. The S proteins from SARS and SARS-CoV-2 are similar, but structural differences in the receptor binding domain (RBD) preclude the use of SARS-specific neutralizing antibodies to inhibit SARS-CoV-2. Here we used comparative pangenomic analysis of all sequenced reference Betacoronaviruses, complemented with functional and structural analyses. This analysis reveals that, among all core gene clusters present in these viruses, the envelope protein E shows a variant cluster shared by SARS and SARS-CoV-2 with two completely-conserved key functional features, namely an ion-channel, and a PDZ-binding motif (PBM). These features play a key role in the activation of the inflammasome causing the acute respiratory distress syndrome, the leading cause of death in SARS and SARS-CoV-2 infections. Together with functional pangenomic analysis, mutation tracking, and previous evidence, on E protein as a determinant of pathogenicity in SARS, we suggest E protein as an alternative therapeutic target to be considered for further studies to reduce complications of SARS-CoV-2 infections in COVID-19.


Subject(s)
Betacoronavirus/chemistry , Viral Envelope Proteins/chemistry , Viral Envelope Proteins/genetics , COVID-19 , Coronavirus Envelope Proteins , Coronavirus Infections/virology , Genes, Essential , Genes, Viral , Genome, Viral , Humans , Middle East Respiratory Syndrome Coronavirus/chemistry , Middle East Respiratory Syndrome Coronavirus/genetics , Mutation , Open Reading Frames , PDZ Domains , Pandemics , Pneumonia, Viral/virology , Protein Domains , Severe acute respiratory syndrome-related coronavirus/chemistry , SARS-CoV-2 , Viroporin Proteins
10.
Bioinformatics ; 36(2): 422-429, 2020 01 15.
Article in English | MEDLINE | ID: mdl-31350877

ABSTRACT

MOTIVATION: Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein-protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time. RESULTS: We developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins (if available). We evaluate the performance of DeepGOPlus using the CAFA3 evaluation measures and achieve an Fmax of 0.390, 0.557 and 0.614 for BPO, MFO and CCO evaluations, respectively. These results would have made DeepGOPlus one of the three best predictors in CCO and the second best performing method in the BPO and MFO evaluations. We also compare DeepGOPlus with state-of-the-art methods such as DeepText2GO and GOLabeler on another dataset. DeepGOPlus can annotate around 40 protein sequences per second on common hardware, thereby making fast and accurate function predictions available for a wide range of proteins. AVAILABILITY AND IMPLEMENTATION: http://deepgoplus.bio2vec.net/ . SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Neural Networks, Computer , Proteins , Amino Acid Sequence , Protein Interaction Maps
11.
Sci Data ; 6(1): 79, 2019 Jun 03.
Article in English | MEDLINE | ID: mdl-31160594

ABSTRACT

Understanding the relationship between the pathophysiology of infectious disease, the biology of the causative agent and the development of therapeutic and diagnostic approaches is dependent on the synthesis of a wide range of types of information. Provision of a comprehensive and integrated disease phenotype knowledgebase has the potential to provide novel and orthogonal sources of information for the understanding of infectious agent pathogenesis, and support for research on disease mechanisms. We have developed PathoPhenoDB, a database containing pathogen-to-phenotype associations. PathoPhenoDB relies on manual curation of pathogen-disease relations, on ontology-based text mining as well as manual curation to associate host disease phenotypes with infectious agents. Using Semantic Web technologies, PathoPhenoDB also links to knowledge about drug resistance mechanisms and drugs used in the treatment of infectious diseases. PathoPhenoDB is accessible at http://patho.phenomebrowser.net/ , and the data are freely available through a public SPARQL endpoint.


Subject(s)
Communicable Diseases , Host-Pathogen Interactions , Phenotype , Databases, Factual , Humans , Semantic Web , User-Computer Interface
12.
BMC Bioinformatics ; 20(1): 65, 2019 Feb 06.
Article in English | MEDLINE | ID: mdl-30727941

ABSTRACT

BACKGROUND: Prioritization of variants in personal genomic data is a major challenge. Recently, computational methods that rely on comparing phenotype similarity have shown to be useful to identify causative variants. In these methods, pathogenicity prediction is combined with a semantic similarity measure to prioritize not only variants that are likely to be dysfunctional but those that are likely involved in the pathogenesis of a patient's phenotype. RESULTS: We have developed DeepPVP, a variant prioritization method that combined automated inference with deep neural networks to identify the likely causative variants in whole exome or whole genome sequence data. We demonstrate that DeepPVP performs significantly better than existing methods, including phenotype-based methods that use similar features. DeepPVP is freely available at https://github.com/bio-ontology-research-group/phenomenet-vp . CONCLUSIONS: DeepPVP further improves on existing variant prioritization methods both in terms of speed as well as accuracy.


Subject(s)
Deep Learning , Genetic Variation , Software , Exome/genetics , Gene Frequency/genetics , Humans , Neural Networks, Computer , Phenotype , Exome Sequencing
13.
Bioinformatics ; 34(17): i857-i865, 2018 09 01.
Article in English | MEDLINE | ID: mdl-30423068

ABSTRACT

Motivation: Function annotations of gene products, and phenotype annotations of genotypes, provide valuable information about molecular mechanisms that can be utilized by computational methods to identify functional and phenotypic relatedness, improve our understanding of disease and pathobiology, and lead to discovery of drug targets. Identifying functions and phenotypes commonly requires experiments which are time-consuming and expensive to carry out; creating the annotations additionally requires a curator to make an assertion based on reported evidence. Support to validate the mutual consistency of functional and phenotype annotations as well as a computational method to predict phenotypes from function annotations, would greatly improve the utility of function annotations. Results: We developed a novel ontology-based method to validate the mutual consistency of function and phenotype annotations. We apply our method to mouse and human annotations, and identify several inconsistencies that can be resolved to improve overall annotation quality. We also apply our method to the rule-based prediction of regulatory phenotypes from functions and demonstrate that we can predict these phenotypes with Fmax of up to 0.647. Availability and implementation: https://github.com/bio-ontology-research-group/phenogocon.


Subject(s)
Phenotype , Animals , Computational Biology , Gene Ontology , Humans , Mice , Molecular Sequence Annotation , Proteins/genetics , Software
14.
Sci Rep ; 8(1): 14681, 2018 10 02.
Article in English | MEDLINE | ID: mdl-30279426

ABSTRACT

An increasing number of disorders have been identified for which two or more distinct alleles in two or more genes are required to either cause the disease or to significantly modify its onset, severity or phenotype. It is difficult to discover such interactions using existing approaches. The purpose of our work is to develop and evaluate a system that can identify combinations of alleles underlying digenic and oligogenic diseases in individual whole exome or whole genome sequences. Information that links patient phenotypes to databases of gene-phenotype associations observed in clinical or non-human model organism research can provide useful information and improve variant prioritization for genetic diseases. Additional background knowledge about interactions between genes can be utilized to identify sets of variants in different genes in the same individual which may then contribute to the overall disease phenotype. We have developed OligoPVP, an algorithm that can be used to prioritize causative combinations of variants in digenic and oligogenic diseases, using whole exome or whole genome sequences together with patient phenotypes as input. We demonstrate that OligoPVP has significantly improved performance when compared to state of the art pathogenicity detection methods in the case of digenic diseases. Our results show that OligoPVP can efficiently prioritize sets of variants in digenic diseases using a phenotype-driven approach and identify etiologically important variants in whole genomes. OligoPVP naturally extends to oligogenic disease involving interactions between variants in two or more genes. It can be applied to the identification of multiple interacting candidate variants contributing to phenotype, where the action of modifier genes is suspected from pedigree analysis or failure of traditional causative variant identification.


Subject(s)
Computational Biology/methods , Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Genomics/methods , Multifactorial Inheritance , Genetic Association Studies , Humans
15.
Bioinformatics ; 34(4): 660-668, 2018 02 15.
Article in English | MEDLINE | ID: mdl-29028931

ABSTRACT

Motivation: A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. Results: We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein-protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Availability and implementation: Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo. Contact: robert.hoehndorf@kaust.edu.sa. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Gene Ontology , Protein Interaction Maps , Proteins/metabolism , Sequence Analysis, Protein/methods , Software , Supervised Machine Learning , Animals , Bacteria/metabolism , Computational Biology/methods , Eukaryota/metabolism , Humans , Proteins/physiology
16.
Sci Rep ; 7(1): 5968, 2017 07 20.
Article in English | MEDLINE | ID: mdl-28729549

ABSTRACT

Tomato is the most economically important horticultural crop used as a model to study plant biology and particularly fruit development. Knowledge obtained from tomato research initiated improvements in tomato and, being transferrable to other such economically important crops, has led to a surge of tomato-related research and published literature. We developed DES-TOMATO knowledgebase (KB) for exploration of information related to tomato. Information exploration is enabled through terms from 26 dictionaries and combination of these terms. To illustrate the utility of DES-TOMATO, we provide several examples how one can efficiently use this KB to retrieve known or potentially novel information. DES-TOMATO is free for academic and nonprofit users and can be accessed at http://cbrc.kaust.edu.sa/des_tomato/, using any of the mainstream web browsers, including Firefox, Safari and Chrome.


Subject(s)
Knowledge Bases , Solanum lycopersicum/genetics , Genes, Plant , Genetic Association Studies , Information Storage and Retrieval , Semantics
17.
PLoS Comput Biol ; 13(4): e1005500, 2017 04.
Article in English | MEDLINE | ID: mdl-28414800

ABSTRACT

Discriminating the causative disease variant(s) for individuals with inherited or de novo mutations presents one of the main challenges faced by the clinical genetics community today. Computational approaches for variant prioritization include machine learning methods utilizing a large number of features, including molecular information, interaction networks, or phenotypes. Here, we demonstrate the PhenomeNET Variant Predictor (PVP) system that exploits semantic technologies and automated reasoning over genotype-phenotype relations to filter and prioritize variants in whole exome and whole genome sequencing datasets. We demonstrate the performance of PVP in identifying causative variants on a large number of synthetic whole exome and whole genome sequences, covering a wide range of diseases and syndromes. In a retrospective study, we further illustrate the application of PVP for the interpretation of whole exome sequencing data in patients suffering from congenital hypothyroidism. We find that PVP accurately identifies causative variants in whole exome and whole genome sequencing datasets and provides a powerful resource for the discovery of causal variants.


Subject(s)
Computational Biology/methods , Exome/genetics , Genetic Variation/genetics , Genome/genetics , Molecular Sequence Annotation/methods , Semantics , Algorithms , Humans , Phenotype , Retrospective Studies
18.
J Biomed Semantics ; 8(1): 7, 2017 Feb 13.
Article in English | MEDLINE | ID: mdl-28193260

ABSTRACT

BACKGROUND: Ontologies are widely used as metadata in biological and biomedical datasets. Measures of semantic similarity utilize ontologies to determine how similar two entities annotated with classes from ontologies are, and semantic similarity is increasingly applied in applications ranging from diagnosis of disease to investigation in gene networks and functions of gene products. RESULTS: Here, we analyze a large number of semantic similarity measures and the sensitivity of similarity values to the number of annotations of entities, difference in annotation size and to the depth or specificity of annotation classes. We find that most similarity measures are sensitive to the number of annotations of entities, difference in annotation size as well as to the depth of annotation classes; well-studied and richly annotated entities will usually show higher similarity than entities with only few annotations even in the absence of any biological relation. CONCLUSIONS: Our findings may have significant impact on the interpretation of results that rely on measures of semantic similarity, and we demonstrate how the sensitivity to annotation size can lead to a bias when using semantic similarity to predict protein-protein interactions.


Subject(s)
Gene Ontology , Molecular Sequence Annotation , Semantics , Disease/genetics , Protein Interaction Mapping
SELECTION OF CITATIONS
SEARCH DETAIL
...