Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
Hum Mutat ; 40(9): 1519-1529, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31342580

RESUMEN

The NAGLU challenge of the fourth edition of the Critical Assessment of Genome Interpretation experiment (CAGI4) in 2016, invited participants to predict the impact of variants of unknown significance (VUS) on the enzymatic activity of the lysosomal hydrolase α-N-acetylglucosaminidase (NAGLU). Deficiencies in NAGLU activity lead to a rare, monogenic, recessive lysosomal storage disorder, Sanfilippo syndrome type B (MPS type IIIB). This challenge attracted 17 submissions from 10 groups. We observed that top models were able to predict the impact of missense mutations on enzymatic activity with Pearson's correlation coefficients of up to .61. We also observed that top methods were significantly more correlated with each other than they were with observed enzymatic activity values, which we believe speaks to the importance of sequence conservation across the different methods. Improved functional predictions on the VUS will help population-scale analysis of disease epidemiology and rare variant association analysis.


Asunto(s)
Acetilglucosaminidasa/metabolismo , Biología Computacional/métodos , Mutación Missense , Acetilglucosaminidasa/genética , Humanos , Modelos Genéticos , Análisis de Regresión
2.
Nat Methods ; 10(3): 221-7, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-23353650

RESUMEN

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.


Asunto(s)
Biología Computacional/métodos , Biología Molecular/métodos , Anotación de Secuencia Molecular , Proteínas/fisiología , Algoritmos , Animales , Bases de Datos de Proteínas , Exorribonucleasas/clasificación , Exorribonucleasas/genética , Exorribonucleasas/fisiología , Predicción , Humanos , Proteínas/química , Proteínas/clasificación , Proteínas/genética , Especificidad de la Especie
3.
Bioinformatics ; 30(17): i609-16, 2014 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-25161254

RESUMEN

MOTIVATION: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy. RESULTS: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteínas/fisiología , Algoritmos , Biología Computacional/métodos , Ontología de Genes , Anotación de Secuencia Molecular , Proteínas/genética , Alineación de Secuencia
4.
Bioinformatics ; 29(13): i53-61, 2013 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-23813009

RESUMEN

MOTIVATION: The development of effective methods for the prediction of ontological annotations is an important goal in computational biology, with protein function prediction and disease gene prioritization gaining wide recognition. Although various algorithms have been proposed for these tasks, evaluating their performance is difficult owing to problems caused both by the structure of biomedical ontologies and biased or incomplete experimental annotations of genes and gene products. RESULTS: We propose an information-theoretic framework to evaluate the performance of computational protein function prediction. We use a Bayesian network, structured according to the underlying ontology, to model the prior probability of a protein's function. We then define two concepts, misinformation and remaining uncertainty, that can be seen as information-theoretic analogs of precision and recall. Finally, we propose a single statistic, referred to as semantic distance, that can be used to rank classification models. We evaluate our approach by analyzing the performance of three protein function predictors of Gene Ontology terms and provide evidence that it addresses several weaknesses of currently used metrics. We believe this framework provides useful insights into the performance of protein function prediction tools. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Ontología de Genes , Anotación de Secuencia Molecular , Proteínas/fisiología , Algoritmos , Teorema de Bayes , Interpretación Estadística de Datos , Genes , Humanos , Proteínas/química , Proteínas/genética
5.
Genome Biol ; 24(1): 172, 2023 07 21.
Artículo en Inglés | MEDLINE | ID: mdl-37480112

RESUMEN

BACKGROUND: Metachromatic leukodystrophy (MLD) is a lysosomal storage disorder caused by mutations in the arylsulfatase A gene (ARSA) and categorized into three subtypes according to age of onset. The functional effect of most ARSA mutants remains unknown; better understanding of the genotype-phenotype relationship is required to support newborn screening (NBS) and guide treatment. RESULTS: We collected a patient data set from the literature that relates disease severity to ARSA genotype in 489 individuals with MLD. Patient-based data were used to develop a phenotype matrix that predicts MLD phenotype given ARSA alleles in a patient's genotype with 76% accuracy. We then employed a high-throughput enzyme activity assay using mass spectrometry to explore the function of ARSA variants from the curated patient data set and the Genome Aggregation Database (gnomAD). We observed evidence that 36% of variants of unknown significance (VUS) in ARSA may be pathogenic. By classifying functional effects for 251 VUS from gnomAD, we reduced the incidence of genotypes of unknown significance (GUS) by over 98.5% in the overall population. CONCLUSIONS: These results provide an additional tool for clinicians to anticipate the disease course in MLD patients, identifying individuals at high risk of severe disease to support treatment access. Our results suggest that more than 1 in 3 VUS in ARSA may be pathogenic. We show that combining genetic and biochemical information increases diagnostic yield. Our strategy may apply to other recessive diseases, providing a tool to address the challenge of interpreting VUS within genotype-phenotype relationships and NBS.


Asunto(s)
Leucodistrofia Metacromática , Humanos , Leucodistrofia Metacromática/diagnóstico , Leucodistrofia Metacromática/genética , Fenotipo , Genotipo , Alelos , Gravedad del Paciente
6.
PLoS Comput Biol ; 7(6): e1002073, 2011 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-21695233

RESUMEN

A common assumption in comparative genomics is that orthologous genes share greater functional similarity than do paralogous genes (the "ortholog conjecture"). Many methods used to computationally predict protein function are based on this assumption, even though it is largely untested. Here we present the first large-scale test of the ortholog conjecture using comparative functional genomic data from human and mouse. We use the experimentally derived functions of more than 8,900 genes, as well as an independent microarray dataset, to directly assess our ability to predict function using both orthologs and paralogs. Both datasets show that paralogs are often a much better predictor of function than are orthologs, even at lower sequence identities. Among paralogs, those found within the same species are consistently more functionally similar than those found in a different species. We also find that paralogous pairs residing on the same chromosome are more functionally similar than those on different chromosomes, perhaps due to higher levels of interlocus gene conversion between these pairs. In addition to offering implications for the computational prediction of protein function, our results shed light on the relationship between sequence divergence and functional divergence. We conclude that the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act.


Asunto(s)
Hibridación Genómica Comparativa , Evolución Molecular , Genes , Animales , Dosificación de Gen , Perfilación de la Expresión Génica , Humanos , Ratones , Análisis de Secuencia por Matrices de Oligonucleótidos , Proteínas/genética
7.
Hum Mutat ; 32(10): 1183-90, 2011 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-21796725

RESUMEN

Next-generation sequencing (NGS) technologies are yielding ever higher volumes of human genome sequence data. Given this large amount of data, it has become both a possibility and a priority to determine how disease-causing single nucleotide polymorphisms (SNPs) detected within gene regulatory regions (rSNPs) exert their effects on gene expression. Recently, several studies have explored whether disease-causing polymorphisms have attributes that can distinguish them from those that are neutral, attaining moderate success at discriminating between functional and putatively neutral regulatory SNPs. Here, we have extended this work by assessing the utility of both SNP-based features (those associated only with the polymorphism site and the surrounding DNA) and gene-based features (those derived from the associated gene in whose regulatory region the SNP lies) in the identification of functional regulatory polymorphisms involved in either monogenic or complex disease. Gene-based features were found to be capable of both augmenting and enhancing the utility of SNP-based features in the prediction of known regulatory mutations. Adopting this approach, we achieved an AUC of 0.903 for predicting regulatory SNPs. Finally, our tool predicted 225 new regulatory SNPs with a high degree of confidence, with 105 of the 225 falling into linkage disequilibrium blocks of reported disease-associated genome-wide association studies SNPs.


Asunto(s)
Enfermedades Genéticas Congénitas/genética , Polimorfismo de Nucleótido Simple , Alelos , Quimiocina CCL5/genética , Bases de Datos Genéticas , Regulación de la Expresión Génica , Estudio de Asociación del Genoma Completo , Humanos , Modelos Teóricos , Secuencias Reguladoras de Ácidos Nucleicos , Sensibilidad y Especificidad
8.
Proteins ; 79(7): 2086-96, 2011 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-21671271

RESUMEN

Understanding protein function is one of the keys to understanding life at the molecular level. It is also important in the context of human disease because many conditions arise as a consequence of alterations of protein function. The recent availability of relatively inexpensive sequencing technology has resulted in thousands of complete or partially sequenced genomes with millions of functionally uncharacterized proteins. Such a large volume of data, combined with the lack of high-throughput experimental assays to functionally annotate proteins, attributes to the growing importance of automated function prediction. Here, we study proteins annotated by Gene Ontology (GO) terms and estimate the accuracy of functional transfer from protein sequence only. We find that the transfer of GO terms by pairwise sequence alignments is only moderately accurate, showing a surprisingly small influence of sequence identity (SID) in a broad range (30-100%). We developed and evaluated a new predictor of protein function, functional annotator (FANN), from amino acid sequence. The predictor exploits a multioutput neural network framework which is well suited to simultaneously modeling dependencies between functional terms. Experiments provide evidence that FANN-GO (predictor of GO terms; available from http://www.informatics.indiana.edu/predrag) outperforms standard methods such as transfer by global or local SID as well as GOtcha, a method that incorporates the structure of GO.


Asunto(s)
Redes Neurales de la Computación , Proteínas/química , Proteínas/fisiología , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Animales , Bases de Datos de Proteínas , Humanos , Modelos Biológicos , Reproducibilidad de los Resultados , Relación Estructura-Actividad
9.
Nat Commun ; 12(1): 2224, 2021 04 13.
Artículo en Inglés | MEDLINE | ID: mdl-33850126

RESUMEN

Prioritizing genes for translation to therapeutics for common diseases has been challenging. Here, we propose an approach to identify drug targets with high probability of success by focusing on genes with both gain of function (GoF) and loss of function (LoF) mutations associated with opposing effects on phenotype (Bidirectional Effect Selected Targets, BEST). We find 98 BEST genes for a variety of indications. Drugs targeting those genes are 3.8-fold more likely to be approved than non-BEST genes. We focus on five genes (IGF1R, NPPC, NPR2, FGFR3, and SHOX) with evidence for bidirectional effects on stature. Rare protein-altering variants in those genes result in significantly increased risk for idiopathic short stature (ISS) (OR = 2.75, p = 3.99 × 10-8). Finally, using functional experiments, we demonstrate that adding an exogenous CNP analog (encoded by NPPC) rescues the phenotype, thus validating its potential as a therapeutic treatment for ISS. Our results show the value of looking for bidirectional effects to identify and validate drug targets.


Asunto(s)
Genes , Preparaciones Farmacéuticas , Descubrimiento de Drogas , Enanismo/genética , Estudios de Asociación Genética , Humanos , Péptido Natriurético Tipo-C/genética , Fenotipo , Receptor Tipo 3 de Factor de Crecimiento de Fibroblastos/genética , Receptor IGF Tipo 1/genética , Receptores del Factor Natriurético Atrial/genética , Proteína de la Caja Homeótica de Baja Estatura/genética
10.
Proteins ; 72(3): 1030-7, 2008 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-18300252

RESUMEN

UNLABELLED: One of the most important tasks of modern bioinformatics is the development of computational tools that can be used to understand and treat human disease. To date, a variety of methods have been explored and algorithms for candidate gene prioritization are gaining in their usefulness. Here, we propose an algorithm for detecting gene-disease associations based on the human protein-protein interaction network, known gene-disease associations, protein sequence, and protein functional information at the molecular level. Our method, PhenoPred, is supervised: first, we mapped each gene/protein onto the spaces of disease and functional terms based on distance to all annotated proteins in the protein interaction network. We also encoded sequence, function, physicochemical, and predicted structural properties, such as secondary structure and flexibility. We then trained support vector machines to detect gene-disease associations for a number of terms in Disease Ontology and provided evidence that, despite the noise/incompleteness of experimental data and unfinished ontology of diseases, identification of candidate genes can be successful even when a large number of candidate disease terms are predicted on simultaneously. AVAILABILITY: www.phenopred.org.


Asunto(s)
Algoritmos , Enfermedad , Genes , Humanos , Leucemia/genética , Mapeo de Interacción de Proteínas , Curva ROC
11.
Front Biosci ; 13: 3391-407, 2008 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-18508441

RESUMEN

Advancements in high-throughput technology and computational power have brought about significant progress in our understanding of cellular processes, including an increased appreciation of the intricacies of disease. The computational biology community has made strides in characterizing human disease and implementing algorithms that will be used in translational medicine. Despite this progress, most of the identified biomarkers and proposed methodologies have still not achieved the sensitivity and specificity to be effectively used, for example, in population screening against various diseases. Here we review the current progress in computational methodology developed to exploit major high-throughput experimental platforms towards improved understanding of disease, and argue that an integrated model for biomarker discovery, predictive medicine and treatment is likely to be data-driven and personalized. In such an approach, major data collection is yet to be done and comprehensive computational models are yet to be developed.


Asunto(s)
Biología Computacional/tendencias , Enfermedad/clasificación , Enfermedades Genéticas Congénitas/clasificación , Proteínas/genética , Algoritmos , Animales , Secuencia de Bases , Línea Celular , Modelos Animales de Enfermedad , Humanos , Polimorfismo de Nucleótido Simple , ARN/genética , Terminología como Asunto
12.
PLoS One ; 13(7): e0200008, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29979746

RESUMEN

Given the large and expanding quantity of publicly available sequencing data, it should be possible to extract incidence information for monogenic diseases from allele frequencies, provided one knows which mutations are causal. We tested this idea on a rare, monogenic, lysosomal storage disorder, Sanfilippo Type B (Mucopolysaccharidosis type IIIB). Sanfilippo Type B is caused by mutations in the gene encoding α-N-acetylglucosaminidase (NAGLU). There were 189 NAGLU missense variants found in the ExAC dataset that comprises roughly 60,000 individual exomes. Only 24 of the 189 missense variants were known to be pathogenic; the remaining 165 variants were of unknown significance (VUS), and their potential contribution to disease is unknown. To address this problem, we measured enzymatic activities of 164 NAGLU missense VUS in the ExAC dataset and developed a statistical framework for estimating disease incidence with associated confidence intervals. We found that 25% of VUS decreased the activity of NAGLU to levels consistent with Sanfilippo Type B pathogenic alleles. We found that a substantial fraction of Sanfilippo Type B incidence (67%) could be accounted for by novel mutations not previously identified in patients, illustrating the utility of combining functional activity data for VUS with population-wide allele frequency data in estimating disease incidence.


Asunto(s)
Exoma/genética , Variación Genética , Mucopolisacaridosis III/genética , Acetilglucosaminidasa/química , Acetilglucosaminidasa/genética , Acetilglucosaminidasa/metabolismo , Humanos , Incidencia , Modelos Moleculares , Mucopolisacaridosis III/enzimología , Mutación Missense , Conformación Proteica
13.
Genome Biol ; 17(1): 184, 2016 09 07.
Artículo en Inglés | MEDLINE | ID: mdl-27604469

RESUMEN

BACKGROUND: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.


Asunto(s)
Biología Computacional , Proteínas/química , Programas Informáticos , Relación Estructura-Actividad , Algoritmos , Bases de Datos de Proteínas , Ontología de Genes , Humanos , Anotación de Secuencia Molecular , Proteínas/genética
14.
Pac Symp Biocomput ; : 316-27, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24297558

RESUMEN

We propose a new kernel-based method for the classification of protein sequences and structures. We first represent each protein as a set of time series data using several structural, physicochemical, and predicted properties such as a sequence of consecutive dihedral angles, hydrophobicity indices, or predictions of disordered regions. A kernel function is then computed for pairs of proteins, exploiting the principles of vector quantization and subsequently used with support vector machines for protein classification. Although our method requires a significant pre-processing step, it is fast in the training and prediction stages owing to the linear complexity of kernel computation with the length of protein sequences. We evaluate our approach on two protein classification tasks involving the prediction of SCOP structural classes and catalytic activity according to the Gene Ontology. We provide evidence that the method is competitive when compared to string kernels, and useful for a range of protein classification tasks. Furthermore, the applicability of our approach extends beyond computational biology to any classification of time series data.


Asunto(s)
Proteínas/química , Proteínas/genética , Algoritmos , Secuencia de Aminoácidos , Proteínas Bacterianas/química , Proteínas Bacterianas/clasificación , Proteínas Bacterianas/genética , Biología Computacional , ADN Helicasas/química , ADN Helicasas/clasificación , ADN Helicasas/genética , Minería de Datos/estadística & datos numéricos , Análisis de Fourier , Ontología de Genes/estadística & datos numéricos , Interacciones Hidrofóbicas e Hidrofílicas , Proteínas/clasificación , Homología Estructural de Proteína , Máquina de Vectores de Soporte , Thermus thermophilus/enzimología , Thermus thermophilus/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA