Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Hum Mutat ; 40(9): 1519-1529, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31342580

RESUMEN

The NAGLU challenge of the fourth edition of the Critical Assessment of Genome Interpretation experiment (CAGI4) in 2016, invited participants to predict the impact of variants of unknown significance (VUS) on the enzymatic activity of the lysosomal hydrolase α-N-acetylglucosaminidase (NAGLU). Deficiencies in NAGLU activity lead to a rare, monogenic, recessive lysosomal storage disorder, Sanfilippo syndrome type B (MPS type IIIB). This challenge attracted 17 submissions from 10 groups. We observed that top models were able to predict the impact of missense mutations on enzymatic activity with Pearson's correlation coefficients of up to .61. We also observed that top methods were significantly more correlated with each other than they were with observed enzymatic activity values, which we believe speaks to the importance of sequence conservation across the different methods. Improved functional predictions on the VUS will help population-scale analysis of disease epidemiology and rare variant association analysis.


Asunto(s)
Acetilglucosaminidasa/metabolismo , Biología Computacional/métodos , Mutación Missense , Acetilglucosaminidasa/genética , Humanos , Modelos Genéticos , Análisis de Regresión
2.
Proc Natl Acad Sci U S A ; 111(37): 13361-6, 2014 Sep 16.
Artículo en Inglés | MEDLINE | ID: mdl-25157146

RESUMEN

Pseudogenes are degraded fossil copies of genes. Here, we report a comparison of pseudogenes spanning three phyla, leveraging the completed annotations of the human, worm, and fly genomes, which we make available as an online resource. We find that pseudogenes are lineage specific, much more so than protein-coding genes, reflecting the different remodeling processes marking each organism's genome evolution. The majority of human pseudogenes are processed, resulting from a retrotranspositional burst at the dawn of the primate lineage. This burst can be seen in the largely uniform distribution of pseudogenes across the genome, their preservation in areas with low recombination rates, and their preponderance in highly expressed gene families. In contrast, worm and fly pseudogenes tell a story of numerous duplication events. In worm, these duplications have been preserved through selective sweeps, so we see a large number of pseudogenes associated with highly duplicated families such as chemoreceptors. However, in fly, the large effective population size and high deletion rate resulted in a depletion of the pseudogene complement. Despite large variations between these species, we also find notable similarities. Overall, we identify a broad spectrum of biochemical activity for pseudogenes, with the majority in each organism exhibiting varying degrees of partial activity. In particular, we identify a consistent amount of transcription (∼15%) across all species, suggesting a uniform degradation process. Also, we see a uniform decay of pseudogene promoter activity relative to their coding counterparts and identify a number of pseudogenes with conserved upstream sequences and activity, hinting at potential regulatory roles.


Asunto(s)
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Filogenia , Seudogenes/genética , Animales , Evolución Molecular , Estudios de Asociación Genética , Humanos , Anotación de Secuencia Molecular , Regiones Promotoras Genéticas/genética , Homología de Secuencia de Ácido Nucleico
3.
Nat Methods ; 10(3): 221-7, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-23353650

RESUMEN

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.


Asunto(s)
Biología Computacional/métodos , Biología Molecular/métodos , Anotación de Secuencia Molecular , Proteínas/fisiología , Algoritmos , Animales , Bases de Datos de Proteínas , Exorribonucleasas/clasificación , Exorribonucleasas/genética , Exorribonucleasas/fisiología , Predicción , Humanos , Proteínas/química , Proteínas/clasificación , Proteínas/genética , Especificidad de la Especie
4.
Bioinformatics ; 30(17): i609-16, 2014 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-25161254

RESUMEN

MOTIVATION: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy. RESULTS: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteínas/fisiología , Algoritmos , Biología Computacional/métodos , Ontología de Genes , Anotación de Secuencia Molecular , Proteínas/genética , Alineación de Secuencia
5.
Bioinformatics ; 29(13): i53-61, 2013 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-23813009

RESUMEN

MOTIVATION: The development of effective methods for the prediction of ontological annotations is an important goal in computational biology, with protein function prediction and disease gene prioritization gaining wide recognition. Although various algorithms have been proposed for these tasks, evaluating their performance is difficult owing to problems caused both by the structure of biomedical ontologies and biased or incomplete experimental annotations of genes and gene products. RESULTS: We propose an information-theoretic framework to evaluate the performance of computational protein function prediction. We use a Bayesian network, structured according to the underlying ontology, to model the prior probability of a protein's function. We then define two concepts, misinformation and remaining uncertainty, that can be seen as information-theoretic analogs of precision and recall. Finally, we propose a single statistic, referred to as semantic distance, that can be used to rank classification models. We evaluate our approach by analyzing the performance of three protein function predictors of Gene Ontology terms and provide evidence that it addresses several weaknesses of currently used metrics. We believe this framework provides useful insights into the performance of protein function prediction tools. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Ontología de Genes , Anotación de Secuencia Molecular , Proteínas/fisiología , Algoritmos , Teorema de Bayes , Interpretación Estadística de Datos , Genes , Humanos , Proteínas/química , Proteínas/genética
6.
J Neurointerv Surg ; 2024 Jul 31.
Artículo en Inglés | MEDLINE | ID: mdl-39084857

RESUMEN

BACKGROUND: Ischemic stroke is a leading cause of death and significant long-term disability worldwide. Mechanical thrombectomy is emerging as a standard treatment for eligible patients. As clinical implementation of stent retrieval and aspiration thrombectomy increases, there is a need for physiologically relevant in vitro device efficacy testing. Critical to this testing is the development of standardized 'soft' and 'hard' synthetic blood clots that mimic the properties of human thrombi and are compatible with imaging technologies. Synthetic clots allow researchers to extract information regarding clot integration, model hemodynamics, and quantify the physics of thrombectomy. METHODS: This work develops polyacrylamide and alginate-based synthetic clots that are compatible with particle image velocimetry (PIV) and radiographic imaging techniques while maintaining mechanical properties of 'soft' and 'hard' human clots. Dynamic mechanical analysis testing using an HR2-Rheometer demonstrates comparable mechanical properties to human clots previously tested by this research group and provided in existing literature. RESULTS: The synthetic clots are formulated with either 0.5% w/v polyethylene microspheres for PIV visualization or 20% w/v barium sulfate for angiographic visualization, enabling real-time imaging of clot behavior during thrombectomy simulations. The soft formulation shows compressive and shear properties of ~12 kPa and 2-3 kPa, respectively. The hard clots are 3-4 times stiffer, with compressive and shear properties of 41-42 kPa and 8-9 kPa, respectively. CONCLUSION: Standardized synthetic clots offer a platform for reproducible device testing. This provides a greater understanding of mechanical thrombectomy device efficacy, which may lead to quantifiable advances in device development and eventual improved clinical outcomes.

7.
Mol Ther Methods Clin Dev ; 32(3): 101294, 2024 Sep 12.
Artículo en Inglés | MEDLINE | ID: mdl-39104575

RESUMEN

Adeno-associated virus (AAV)-based vectors are used clinically for gene transfer and persist as extrachromosomal episomes. A small fraction of vector genomes integrate into the host genome, but the theoretical risk of tumorigenesis depends on vector regulatory features. A mouse model was used to investigate integration profiles of an AAV serotype 5 (AAV5) vector produced using Sf and HEK293 cells that mimic key features of valoctocogene roxaparvovec (AAV5-hFVIII-SQ), a gene therapy for severe hemophilia A. The majority (95%) of vector genome reads were derived from episomes, and mean (± standard deviation) integration frequency was 2.70 ± 1.26 and 1.79 ± 0.86 integrations per 1,000 cells for Sf- and HEK293-produced vector. Longitudinal integration analysis suggested integrations occur primarily within 1 week, at low frequency, and their abundance was stable over time. Integration profiles were polyclonal and randomly distributed. No major differences in integration profiles were observed for either vector production platform, and no integrations were associated with clonal expansion. Integrations were enriched near transcription start sites of genes highly expressed in the liver (p = 1 × 10-4) and less enriched for genes of lower expression. We found no evidence of tumorigenesis or fibrosis caused by the vector integrations.

8.
Genome Biol ; 24(1): 172, 2023 07 21.
Artículo en Inglés | MEDLINE | ID: mdl-37480112

RESUMEN

BACKGROUND: Metachromatic leukodystrophy (MLD) is a lysosomal storage disorder caused by mutations in the arylsulfatase A gene (ARSA) and categorized into three subtypes according to age of onset. The functional effect of most ARSA mutants remains unknown; better understanding of the genotype-phenotype relationship is required to support newborn screening (NBS) and guide treatment. RESULTS: We collected a patient data set from the literature that relates disease severity to ARSA genotype in 489 individuals with MLD. Patient-based data were used to develop a phenotype matrix that predicts MLD phenotype given ARSA alleles in a patient's genotype with 76% accuracy. We then employed a high-throughput enzyme activity assay using mass spectrometry to explore the function of ARSA variants from the curated patient data set and the Genome Aggregation Database (gnomAD). We observed evidence that 36% of variants of unknown significance (VUS) in ARSA may be pathogenic. By classifying functional effects for 251 VUS from gnomAD, we reduced the incidence of genotypes of unknown significance (GUS) by over 98.5% in the overall population. CONCLUSIONS: These results provide an additional tool for clinicians to anticipate the disease course in MLD patients, identifying individuals at high risk of severe disease to support treatment access. Our results suggest that more than 1 in 3 VUS in ARSA may be pathogenic. We show that combining genetic and biochemical information increases diagnostic yield. Our strategy may apply to other recessive diseases, providing a tool to address the challenge of interpreting VUS within genotype-phenotype relationships and NBS.


Asunto(s)
Leucodistrofia Metacromática , Humanos , Leucodistrofia Metacromática/diagnóstico , Leucodistrofia Metacromática/genética , Fenotipo , Genotipo , Alelos , Gravedad del Paciente
9.
PLoS Comput Biol ; 7(6): e1002073, 2011 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-21695233

RESUMEN

A common assumption in comparative genomics is that orthologous genes share greater functional similarity than do paralogous genes (the "ortholog conjecture"). Many methods used to computationally predict protein function are based on this assumption, even though it is largely untested. Here we present the first large-scale test of the ortholog conjecture using comparative functional genomic data from human and mouse. We use the experimentally derived functions of more than 8,900 genes, as well as an independent microarray dataset, to directly assess our ability to predict function using both orthologs and paralogs. Both datasets show that paralogs are often a much better predictor of function than are orthologs, even at lower sequence identities. Among paralogs, those found within the same species are consistently more functionally similar than those found in a different species. We also find that paralogous pairs residing on the same chromosome are more functionally similar than those on different chromosomes, perhaps due to higher levels of interlocus gene conversion between these pairs. In addition to offering implications for the computational prediction of protein function, our results shed light on the relationship between sequence divergence and functional divergence. We conclude that the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act.


Asunto(s)
Hibridación Genómica Comparativa , Evolución Molecular , Genes , Animales , Dosificación de Gen , Perfilación de la Expresión Génica , Humanos , Ratones , Análisis de Secuencia por Matrices de Oligonucleótidos , Proteínas/genética
10.
Hum Mutat ; 32(10): 1183-90, 2011 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-21796725

RESUMEN

Next-generation sequencing (NGS) technologies are yielding ever higher volumes of human genome sequence data. Given this large amount of data, it has become both a possibility and a priority to determine how disease-causing single nucleotide polymorphisms (SNPs) detected within gene regulatory regions (rSNPs) exert their effects on gene expression. Recently, several studies have explored whether disease-causing polymorphisms have attributes that can distinguish them from those that are neutral, attaining moderate success at discriminating between functional and putatively neutral regulatory SNPs. Here, we have extended this work by assessing the utility of both SNP-based features (those associated only with the polymorphism site and the surrounding DNA) and gene-based features (those derived from the associated gene in whose regulatory region the SNP lies) in the identification of functional regulatory polymorphisms involved in either monogenic or complex disease. Gene-based features were found to be capable of both augmenting and enhancing the utility of SNP-based features in the prediction of known regulatory mutations. Adopting this approach, we achieved an AUC of 0.903 for predicting regulatory SNPs. Finally, our tool predicted 225 new regulatory SNPs with a high degree of confidence, with 105 of the 225 falling into linkage disequilibrium blocks of reported disease-associated genome-wide association studies SNPs.


Asunto(s)
Enfermedades Genéticas Congénitas/genética , Polimorfismo de Nucleótido Simple , Alelos , Quimiocina CCL5/genética , Bases de Datos Genéticas , Regulación de la Expresión Génica , Estudio de Asociación del Genoma Completo , Humanos , Modelos Teóricos , Secuencias Reguladoras de Ácidos Nucleicos , Sensibilidad y Especificidad
11.
Proteins ; 79(7): 2086-96, 2011 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-21671271

RESUMEN

Understanding protein function is one of the keys to understanding life at the molecular level. It is also important in the context of human disease because many conditions arise as a consequence of alterations of protein function. The recent availability of relatively inexpensive sequencing technology has resulted in thousands of complete or partially sequenced genomes with millions of functionally uncharacterized proteins. Such a large volume of data, combined with the lack of high-throughput experimental assays to functionally annotate proteins, attributes to the growing importance of automated function prediction. Here, we study proteins annotated by Gene Ontology (GO) terms and estimate the accuracy of functional transfer from protein sequence only. We find that the transfer of GO terms by pairwise sequence alignments is only moderately accurate, showing a surprisingly small influence of sequence identity (SID) in a broad range (30-100%). We developed and evaluated a new predictor of protein function, functional annotator (FANN), from amino acid sequence. The predictor exploits a multioutput neural network framework which is well suited to simultaneously modeling dependencies between functional terms. Experiments provide evidence that FANN-GO (predictor of GO terms; available from http://www.informatics.indiana.edu/predrag) outperforms standard methods such as transfer by global or local SID as well as GOtcha, a method that incorporates the structure of GO.


Asunto(s)
Redes Neurales de la Computación , Proteínas/química , Proteínas/fisiología , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Animales , Bases de Datos de Proteínas , Humanos , Modelos Biológicos , Reproducibilidad de los Resultados , Relación Estructura-Actividad
12.
Nat Commun ; 12(1): 2224, 2021 04 13.
Artículo en Inglés | MEDLINE | ID: mdl-33850126

RESUMEN

Prioritizing genes for translation to therapeutics for common diseases has been challenging. Here, we propose an approach to identify drug targets with high probability of success by focusing on genes with both gain of function (GoF) and loss of function (LoF) mutations associated with opposing effects on phenotype (Bidirectional Effect Selected Targets, BEST). We find 98 BEST genes for a variety of indications. Drugs targeting those genes are 3.8-fold more likely to be approved than non-BEST genes. We focus on five genes (IGF1R, NPPC, NPR2, FGFR3, and SHOX) with evidence for bidirectional effects on stature. Rare protein-altering variants in those genes result in significantly increased risk for idiopathic short stature (ISS) (OR = 2.75, p = 3.99 × 10-8). Finally, using functional experiments, we demonstrate that adding an exogenous CNP analog (encoded by NPPC) rescues the phenotype, thus validating its potential as a therapeutic treatment for ISS. Our results show the value of looking for bidirectional effects to identify and validate drug targets.


Asunto(s)
Genes , Preparaciones Farmacéuticas , Descubrimiento de Drogas , Enanismo/genética , Estudios de Asociación Genética , Humanos , Péptido Natriurético Tipo-C/genética , Fenotipo , Receptor Tipo 3 de Factor de Crecimiento de Fibroblastos/genética , Receptor IGF Tipo 1/genética , Receptores del Factor Natriurético Atrial/genética , Proteína de la Caja Homeótica de Baja Estatura/genética
13.
Mol Genet Metab Rep ; 21: 100524, 2019 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-31720227

RESUMEN

INTRODUCTION: GM1 gangliosidosis is a rare autosomal recessive genetic disorder caused by the disruption of the GLB1 gene that encodes ß-galactosidase, a lysosomal hydrolase that removes ß-linked galactose from the non-reducing end of glycans. Deficiency of this catabolic enzyme leads to the lysosomal accumulation of GM1 and its asialo derivative GA1 in ß-galactosidase deficient patients and animal models. In addition to GM1 and GA1, there are other glycoconjugates that contain ß-linked galactose whose metabolites are substrates for ß-galactosidase. For example, a number of N-linked glycan structures that have galactose at their non-reducing end have been shown to accumulate in GM1 gangliosidosis patient tissues and biological fluids. OBJECTIVE: In this study, we attempt to fully characterize the broad array of GLB1 substrates that require GLB1 for their lysosomal turnover. RESULTS: Using tandem mass spectrometry and glycan reductive isotope labeling with data-dependent mass spectrometry, we have confirmed the accumulation of glycolipids (GM1 and GA1) and N-linked glycans with terminal beta-linked galactose. We have also discovered a novel set of core 1 and 2 O-linked glycan metabolites, many of which are part of structurally-related isobaric series that accumulate in disease. In the brain of GLB1 null mice, the levels of these glycan metabolites increased along with those of both GM1 and GA1 as a function of age. In addition to brain tissue, we found elevated levels of both N-linked and O-linked glycan metabolites in a number of peripheral tissues and in urine. Both brain and urine samples from human GM1 gangliosidosis patients exhibited large increases in steady state levels for the same glycan metabolites, demonstrating their correlation with this disease in humans as well. CONCLUSIONS: Our studies illustrate that GLB1 deficiency is not purely a ganglioside accumulation disorder, but instead a broad oligosaccharidosis that include representatives of many ß-linked galactose containing glycans and glycoconjugates including glycolipids, N-linked glycans, and various O-linked glycans. Accounting for all ß-galactosidase substrates that accumulate when this enzyme is deficient increases our understanding of this severe disorder by identifying metabolites that may drive certain aspects of the disease and may also serve as informative disease biomarkers to fully evaluate the efficacy of future therapies.

14.
Proteins ; 72(3): 1030-7, 2008 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-18300252

RESUMEN

UNLABELLED: One of the most important tasks of modern bioinformatics is the development of computational tools that can be used to understand and treat human disease. To date, a variety of methods have been explored and algorithms for candidate gene prioritization are gaining in their usefulness. Here, we propose an algorithm for detecting gene-disease associations based on the human protein-protein interaction network, known gene-disease associations, protein sequence, and protein functional information at the molecular level. Our method, PhenoPred, is supervised: first, we mapped each gene/protein onto the spaces of disease and functional terms based on distance to all annotated proteins in the protein interaction network. We also encoded sequence, function, physicochemical, and predicted structural properties, such as secondary structure and flexibility. We then trained support vector machines to detect gene-disease associations for a number of terms in Disease Ontology and provided evidence that, despite the noise/incompleteness of experimental data and unfinished ontology of diseases, identification of candidate genes can be successful even when a large number of candidate disease terms are predicted on simultaneously. AVAILABILITY: www.phenopred.org.


Asunto(s)
Algoritmos , Enfermedad , Genes , Humanos , Leucemia/genética , Mapeo de Interacción de Proteínas , Curva ROC
15.
Front Biosci ; 13: 3391-407, 2008 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-18508441

RESUMEN

Advancements in high-throughput technology and computational power have brought about significant progress in our understanding of cellular processes, including an increased appreciation of the intricacies of disease. The computational biology community has made strides in characterizing human disease and implementing algorithms that will be used in translational medicine. Despite this progress, most of the identified biomarkers and proposed methodologies have still not achieved the sensitivity and specificity to be effectively used, for example, in population screening against various diseases. Here we review the current progress in computational methodology developed to exploit major high-throughput experimental platforms towards improved understanding of disease, and argue that an integrated model for biomarker discovery, predictive medicine and treatment is likely to be data-driven and personalized. In such an approach, major data collection is yet to be done and comprehensive computational models are yet to be developed.


Asunto(s)
Biología Computacional/tendencias , Enfermedad/clasificación , Enfermedades Genéticas Congénitas/clasificación , Proteínas/genética , Algoritmos , Animales , Secuencia de Bases , Línea Celular , Modelos Animales de Enfermedad , Humanos , Polimorfismo de Nucleótido Simple , ARN/genética , Terminología como Asunto
16.
PLoS One ; 13(7): e0200008, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29979746

RESUMEN

Given the large and expanding quantity of publicly available sequencing data, it should be possible to extract incidence information for monogenic diseases from allele frequencies, provided one knows which mutations are causal. We tested this idea on a rare, monogenic, lysosomal storage disorder, Sanfilippo Type B (Mucopolysaccharidosis type IIIB). Sanfilippo Type B is caused by mutations in the gene encoding α-N-acetylglucosaminidase (NAGLU). There were 189 NAGLU missense variants found in the ExAC dataset that comprises roughly 60,000 individual exomes. Only 24 of the 189 missense variants were known to be pathogenic; the remaining 165 variants were of unknown significance (VUS), and their potential contribution to disease is unknown. To address this problem, we measured enzymatic activities of 164 NAGLU missense VUS in the ExAC dataset and developed a statistical framework for estimating disease incidence with associated confidence intervals. We found that 25% of VUS decreased the activity of NAGLU to levels consistent with Sanfilippo Type B pathogenic alleles. We found that a substantial fraction of Sanfilippo Type B incidence (67%) could be accounted for by novel mutations not previously identified in patients, illustrating the utility of combining functional activity data for VUS with population-wide allele frequency data in estimating disease incidence.


Asunto(s)
Exoma/genética , Variación Genética , Mucopolisacaridosis III/genética , Acetilglucosaminidasa/química , Acetilglucosaminidasa/genética , Acetilglucosaminidasa/metabolismo , Humanos , Incidencia , Modelos Moleculares , Mucopolisacaridosis III/enzimología , Mutación Missense , Conformación Proteica
17.
Genome Biol ; 17(1): 184, 2016 09 07.
Artículo en Inglés | MEDLINE | ID: mdl-27604469

RESUMEN

BACKGROUND: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.


Asunto(s)
Biología Computacional , Proteínas/química , Programas Informáticos , Relación Estructura-Actividad , Algoritmos , Bases de Datos de Proteínas , Ontología de Genes , Humanos , Anotación de Secuencia Molecular , Proteínas/genética
18.
Nat Commun ; 6: 7256, 2015 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-26028266

RESUMEN

Investigating genomic structural variants at basepair resolution is crucial for understanding their formation mechanisms. We identify and analyse 8,943 deletion breakpoints in 1,092 samples from the 1000 Genomes Project. We find breakpoints have more nearby SNPs and indels than the genomic average, likely a consequence of relaxed selection. By investigating the correlation of breakpoints with DNA methylation, Hi-C interactions, and histone marks and the substitution patterns of nucleotides near them, we find that breakpoints with the signature of non-allelic homologous recombination (NAHR) are associated with open chromatin. We hypothesize that some NAHR deletions occur without DNA replication and cell division, in embryonic and germline cells. In contrast, breakpoints associated with non-homologous (NH) mechanisms often have sequence microinsertions, templated from later replicating genomic sites, spaced at two characteristic distances from the breakpoint. These microinsertions are consistent with template-switching events and suggest a particular spatiotemporal configuration for DNA during the events.


Asunto(s)
Puntos de Rotura del Cromosoma , ADN/metabolismo , Eliminación de Gen , Genoma Humano/genética , Cromatina , Replicación del ADN , Recombinación Homóloga , Humanos , Mutación , Nucleótidos , Eliminación de Secuencia
19.
Pac Symp Biocomput ; : 316-27, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24297558

RESUMEN

We propose a new kernel-based method for the classification of protein sequences and structures. We first represent each protein as a set of time series data using several structural, physicochemical, and predicted properties such as a sequence of consecutive dihedral angles, hydrophobicity indices, or predictions of disordered regions. A kernel function is then computed for pairs of proteins, exploiting the principles of vector quantization and subsequently used with support vector machines for protein classification. Although our method requires a significant pre-processing step, it is fast in the training and prediction stages owing to the linear complexity of kernel computation with the length of protein sequences. We evaluate our approach on two protein classification tasks involving the prediction of SCOP structural classes and catalytic activity according to the Gene Ontology. We provide evidence that the method is competitive when compared to string kernels, and useful for a range of protein classification tasks. Furthermore, the applicability of our approach extends beyond computational biology to any classification of time series data.


Asunto(s)
Proteínas/química , Proteínas/genética , Algoritmos , Secuencia de Aminoácidos , Proteínas Bacterianas/química , Proteínas Bacterianas/clasificación , Proteínas Bacterianas/genética , Biología Computacional , ADN Helicasas/química , ADN Helicasas/clasificación , ADN Helicasas/genética , Minería de Datos/estadística & datos numéricos , Análisis de Fourier , Ontología de Genes/estadística & datos numéricos , Interacciones Hidrofóbicas e Hidrofílicas , Proteínas/clasificación , Homología Estructural de Proteína , Máquina de Vectores de Soporte , Thermus thermophilus/enzimología , Thermus thermophilus/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA