Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 21(5): 1596-1608, 2020 09 25.
Artigo em Inglês | MEDLINE | ID: mdl-32978619

RESUMO

Bacterial proteins dubbed virulence factors (VFs) are a highly diverse group of sequences, whose only obvious commonality is the very property of being, more or less directly, involved in virulence. It is therefore tempting to speculate whether their prediction, based on direct sequence similarity (seqsim) to known VFs, could be enhanced or even replaced by using machine-learning methods. Specifically, when trained on a large and diverse set of VFs, such may be able to detect putative, non-trivial characteristics shared by otherwise unrelated VF families and therefore better predict novel VFs with insignificant similarity to each individual family. We therefore first reassess the performance of dimer-based Support Vector Machines, as used in the widely used MP3 method, in light of seqsim-only and seqsim/dimer-hybrid classifiers. We then repeat the analysis with a novel, considerably more diverse data set, also addressing the important problem of negative data selection. Finally, we move on to the real-world use case of proteome-wide VF prediction, outlining different approaches to estimating specificity in this scenario. We find that direct seqsim is of unparalleled importance and therefore should always be exploited. Further, we observe strikingly low correlations between different feature and classifier types when ranking proteins by VF likeness. We therefore propose a 'best of each world' approach to prioritize proteins for experimental testing, focussing on the top predictions of each classifier. Further, classifiers for individual VF families should be developed.


Assuntos
Bactérias/patogenicidade , Proteínas de Bactérias/metabolismo , Máquina de Vetores de Suporte , Fatores de Virulência/metabolismo , Algoritmos , Sequência de Aminoácidos , Proteínas de Bactérias/química , Conjuntos de Dados como Assunto , Dimerização , Proteoma , Fatores de Virulência/química
2.
Bioinformatics ; 36(1): 81-89, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31298694

RESUMO

MOTIVATION: We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. Moreover, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious purposes. Traditional approaches to open-view pathogen detection depend on databases of known organisms, which limits their performance on unknown, unrecognized and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads, even though the biological context is unavailable. RESULTS: We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. We show that convolutional neural networks and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning. Combining a deep learning approach with integrating the predictions for both mates in a read pair results in cutting the error rate almost in half in comparison to the previous state-of-the-art. AVAILABILITY AND IMPLEMENTATION: The code and the models are available at: https://gitlab.com/rki_bioinformatics/DeePaC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , DNA , Aprendizado Profundo , Análise de Sequência de DNA
3.
Brief Bioinform ; 16(6): 1045-56, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-25900849

RESUMO

There is a growing interest in the mechanisms and the prediction of how flexible peptides bind proteins, often in a highly selective and conserved manner. While both existing small-molecule docking methods and custom protocols can be used, even short peptides make difficult targets owing to their high torsional flexibility. Any benchmarking should therefore start with those. We compiled a meta-data set of 47 complexes with peptides up to five residues, based on 11 related studies from the past decade. Although their highly varying strategies and constraints preclude direct, quantitative comparisons, we still provide a comprehensive overview of the reported results, using a simple yet stringent measure: the quality of the top-scoring peptide pose. Using the entire data set, this is augmented by our own benchmark of AutoDock Vina, a freely available, fast and widely used docking tool. It particularly addresses non-expert users and was therefore implemented in a highly integrated manner. Guidelines addressing important issues such as the amount of sampling required for result reproducibility are so far lacking. Using peptide docking as an example, this is the first study to address these issues in detail. Finally, to encourage further, standardized benchmarking efforts, the compiled data set is made available in an accessible, transparent and extendable manner.


Assuntos
Peptídeos/química , Simulação de Acoplamento Molecular , Reprodutibilidade dos Testes
4.
Nat Methods ; 10(3): 221-7, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23353650

RESUMO

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.


Assuntos
Biologia Computacional/métodos , Biologia Molecular/métodos , Anotação de Sequência Molecular , Proteínas/fisiologia , Algoritmos , Animais , Bases de Dados de Proteínas , Exorribonucleases/classificação , Exorribonucleases/genética , Exorribonucleases/fisiologia , Previsões , Humanos , Proteínas/química , Proteínas/classificação , Proteínas/genética , Especificidade da Espécie
5.
Nucleic Acids Res ; 42(Database issue): D240-5, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24270792

RESUMO

Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year.


Assuntos
Bases de Dados de Proteínas , Anotação de Sequência Molecular , Estrutura Terciária de Proteína , Genoma , Genômica , Internet , Modelos Moleculares , Estrutura Terciária de Proteína/genética , Análise de Sequência de Proteína
6.
Nucleic Acids Res ; 41(Database issue): D490-8, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23203873

RESUMO

CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily.


Assuntos
Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Genômica , Internet , Anotação de Sequência Molecular , Dobramento de Proteína , Proteínas/química , Proteínas/classificação , Proteínas/genética , Alinhamento de Sequência , Análise de Sequência de Proteína , Homologia Estrutural de Proteína
7.
Nucleic Acids Res ; 40(Database issue): D465-71, 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22139938

RESUMO

Gene3D http://gene3d.biochem.ucl.ac.uk is a comprehensive database of protein domain assignments for sequences from the major sequence databases. Domains are directly mapped from structures in the CATH database or predicted using a library of representative profile HMMs derived from CATH superfamilies. As previously described, Gene3D integrates many other protein family and function databases. These facilitate complex associations of molecular function, structure and evolution. Gene3D now includes a domain functional family (FunFam) level below the homologous superfamily level assignments. Additions have also been made to the interaction data. More significantly, to help with the visualization and interpretation of multi-genome scale data sets, we have developed a new, revamped website. Searching has been simplified with more sophisticated filtering of results, along with new tools based on Cytoscape Web, for visualizing protein-protein interaction networks, differences in domain composition between genomes and the taxonomic distribution of individual superfamilies.


Assuntos
Bases de Dados de Proteínas , Anotação de Sequência Molecular , Mapas de Interação de Proteínas , Estrutura Terciária de Proteína , Genômica , Proteínas/química , Proteínas/classificação , Proteínas/genética
8.
BMC Bioinformatics ; 14 Suppl 3: S5, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23514456

RESUMO

Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.


Assuntos
Estrutura Terciária de Proteína , Proteínas/fisiologia , Análise por Conglomerados , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Proteínas/classificação , Proteínas/genética , Análise de Sequência de Proteína , Vocabulário Controlado
9.
Nucleic Acids Res ; 39(Database issue): D420-6, 2011 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-21097779

RESUMO

CATH version 3.3 (class, architecture, topology, homology) contains 128,688 domains, 2386 homologous superfamilies and 1233 fold groups, and reflects a major focus on classifying structural genomics (SG) structures and transmembrane proteins, both of which are likely to add structural novelty to the database and therefore increase the coverage of protein fold space within CATH. For CATH version 3.4 we have significantly improved the presentation of sequence information and associated functional information for CATH superfamilies. The CATH superfamily pages now reflect both the functional and structural diversity within the superfamily and include structural alignments of close and distant relatives within the superfamily, annotated with functional information and details of conserved residues. A significantly more efficient search function for CATH has been established by implementing the search server Solr (http://lucene.apache.org/solr/). The CATH v3.4 webpages have been built using the Catalyst web framework.


Assuntos
Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Filogenia , Dobramento de Proteína , Proteínas/química , Proteínas/classificação
10.
Nucleic Acids Res ; 38(3): 720-37, 2010 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19923231

RESUMO

GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile-profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics.


Assuntos
Algoritmos , Estrutura Terciária de Proteína , Benchmarking , Classificação/métodos , Modelos Químicos , Proteínas/classificação , Análise de Sequência de Proteína
11.
Trends Biotechnol ; 27(4): 210-9, 2009 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-19251332

RESUMO

Advances in experimental and computational methods have quietly ushered in a new era in protein function annotation. This 'age of multiplicity' is marked by the notion that only the use of multiple tools, multiple evidence and considering the multiple aspects of function can give us the broad picture that 21st century biology will need to link and alter micro- and macroscopic phenotypes. It might also help us to undo past mistakes by removing errors from our databases and prevent us from producing more. On the downside, multiplicity is often confusing. We therefore systematically review methods and resources for automated protein function prediction, looking at individual (biochemical) and contextual (network) functions, respectively.


Assuntos
Biologia Computacional/métodos , Proteínas/fisiologia , Inteligência Artificial , Bases de Dados de Proteínas , Reconhecimento Automatizado de Padrão , Filogenia , Proteínas/química
13.
Sci Rep ; 7: 39194, 2017 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-28051068

RESUMO

The reliable detection of novel bacterial pathogens from next-generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from the reference database. Here we present the machine learning based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide range of species with known pathogenicity phenotype. To that end we compiled a comprehensive list of pathogenic and non-pathogenic bacteria with human host, using various genome metadata in conjunction with a rule-based protocol. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads with low similarity to currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. CombiningPaPrBaG with existing approaches further improves prediction results.


Assuntos
Bactérias/isolamento & purificação , Infecções Bacterianas/etiologia , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Aprendizado de Máquina , Análise de Sequência de DNA/métodos , Bactérias/genética , Humanos
14.
J Mol Biol ; 387(2): 416-30, 2009 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-19135455

RESUMO

Divergence in function of homologous proteins is based on both sequence and structural changes. Overall enzyme function has been reported to diverge earlier (50% sequence identity) than overall structure (35%). We herein study the functional conservation of enzymes and non-enzyme sequences using the protein domain families in CATH-Gene3D. Despite the rapid increase in sequence data since the last comprehensive study by Tian and Skolnick, our findings suggest that generic thresholds of 40% and 60% aligned sequence identity are still sufficient to safely inherit third-level and full Enzyme Commission numbers, respectively. This increases to 50% and 70% on the domain level, unless the multi-domain architecture matches. Assignments from the Kyoto Encyclopedia of Genes and Genomes and the Munich Information Center for Protein Sequences Functional Catalogue seem to be less conserved with sequence, probably due to a more pathway-centric view: 80% domain sequence identity is required for safe function transfer. Comparing domains (more pairwise relationships) and the use of family-specific thresholds (varying evolutionary speeds) yields the highest coverage rates when transferring functions to model proteomes. An average twofold increase in enzyme annotations is seen for 523 proteomes in Gene3D. As simple 'rules of thumb', sequence identity thresholds do not require a bioinformatics background. We will provide and update this information with future releases of CATH-Gene3D.


Assuntos
Proteínas/química , Proteínas/metabolismo , Análise de Sequência de Proteína , Sequência de Aminoácidos , Enzimas/química , Enzimas/metabolismo , Genoma/genética , Modelos Biológicos , Família Multigênica , Estrutura Terciária de Proteína , Proteoma/química , Proteoma/metabolismo , Homologia de Sequência de Aminoácidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA