Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
Bioinformatics ; 39(10)2023 10 03.
Artículo en Inglés | MEDLINE | ID: mdl-37874958

RESUMEN

MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes. RESULTS: We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910.


Asunto(s)
Proteínas , Programas Informáticos , Humanos , Proteínas/química , Redes Neurales de la Computación , Secuencia de Aminoácidos , Bases del Conocimiento
2.
PLoS Comput Biol ; 19(8): e1011404, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37651409

RESUMEN

Numerous computational methods based on sequences or structures have been developed for the characterization of protein function, but they are still unsatisfactory to deal with the multiple functions of multi-domain protein families. Here we propose an original approach based on 1) the detection of conserved sequence modules using partial local multiple alignment, 2) the phylogenetic inference of species/genes/modules/functions evolutionary histories, and 3) the identification of co-appearances of modules and functions. Applying our framework to the multidomain ADAMTS-TSL family including ADAMTS (A Disintegrin-like and Metalloproteinase with ThromboSpondin motif) and ADAMTS-like proteins over nine species including human, we identify 45 sequence module signatures that are associated with the occurrence of 278 Protein-Protein Interactions in ancestral genes. Some of these signatures are supported by published experimental data and the others provide new insights (e.g. ADAMTS-5). The module signatures of ADAMTS ancestors notably highlight the dual variability of the propeptide and ancillary regions suggesting the importance of these two regions in the specialization of ADAMTS during evolution. Our analyses further indicate convergent interactions of ADAMTS with COMP and CCN2 proteins. Overall, our study provides 186 sequence module signatures that discriminate distinct subgroups of ADAMTS and ADAMTSL and that may result from selective pressures on novel functions and phenotypes.


Asunto(s)
Redes Reguladoras de Genes , Humanos , Filogenia , Secuencia Conservada , Fenotipo
3.
BMC Bioinformatics ; 22(1): 317, 2021 Jun 10.
Artículo en Inglés | MEDLINE | ID: mdl-34112081

RESUMEN

BACKGROUND: To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use. METHODS: We introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between [Formula: see text] and [Formula: see text]) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time ([Formula: see text] in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and independent-site PPalign. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean [Formula: see text] score and finds significantly better alignments than HHalign and PPalign without couplings in some cases. CONCLUSIONS: These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign's guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction.


Asunto(s)
Algoritmos , Proteínas , Secuencia de Aminoácidos , Humanos , Proteínas/genética , Alineación de Secuencia , Homología de Secuencia
4.
Bioinformatics ; 32(9): 1405-7, 2016 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-26733451

RESUMEN

MOTIVATION: Not only sequence data continue to outpace annotation information, but also the problem is further exacerbated when organisms are underrepresented in the annotation databases. This is the case with non-human-pathogenic viruses which occur frequently in metagenomic projects. Thus, there is a need for tools capable of detecting and classifying viral sequences. RESULTS: We describe VIRALpro a new effective tool for identifying capsid and tail protein sequences, which are the cornerstones toward viral sequence annotation and viral genome classification. AVAILABILITY AND IMPLEMENTATION: The data, software and corresponding web server are available from http://scratch.proteomics.ics.uci.edu as part of the SCRATCH suite. CONTACT: clovis.galiez@inria.fr or pfbaldi@uci.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Cápside , Genoma Viral , Programas Informáticos , Secuencia de Aminoácidos , Humanos
5.
BMC Bioinformatics ; 16: 256, 2015 Aug 14.
Artículo en Inglés | MEDLINE | ID: mdl-26268224

RESUMEN

BACKGROUND: In structural bioinformatics, there is an increasing interest in identifying and understanding the evolution of local protein structures regarded as key structural or functional protein building blocks. A central need is then to compare these, possibly short, fragments by measuring efficiently and accurately their (dis)similarity. Progress towards this goal has given rise to scores enabling to assess the strong similarity of fragments. Yet, there is still a lack of more progressive scores, with meaningful intermediate values, for the comparison, retrieval or clustering of distantly related fragments. RESULTS: We introduce here the Amplitude Spectrum Distance (ASD), a novel way of comparing protein fragments based on the discrete Fourier transform of their C(α) distance matrix. Defined as the distance between their amplitude spectra, ASD can be computed efficiently and provides a parameter-free measure of the global shape dissimilarity of two fragments. ASD inherits from nice theoretical properties, making it tolerant to shifts, insertions, deletions, circular permutations or sequence reversals while satisfying the triangle inequality. The practical interest of ASD with respect to RMSD, RMSDd, BC and TM scores is illustrated through zinc finger retrieval experiments and concrete structure examples. The benefits of ASD are also illustrated by two additional clustering experiments: domain linkers fragments and complementarity-determining regions of antibodies. CONCLUSIONS: Taking advantage of the Fourier transform to compare fragments at a global shape level, ASD is an objective and progressive measure taking into account the whole fragments. Its practical computation time and its properties make ASD particularly relevant for applications requiring meaningful measures on distantly related protein fragments, such as similar fragments retrieval asking for high recalls as shown in the experiments, or for any application taking also advantage of triangle inequality, such as fragments clustering. ASD program and source code are freely available at: http://www.irisa.fr/dyliss/public/ASD/.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Fragmentos de Péptidos/química , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Análisis por Conglomerados , Humanos
6.
Nucleic Acids Res ; 41(Database issue): D396-401, 2013 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-23175607

RESUMEN

CyanoLyase (http://cyanolyase.genouest.org/) is a manually curated sequence and motif database of phycobilin lyases and related proteins. These enzymes catalyze the covalent ligation of chromophores (phycobilins) to specific binding sites of phycobiliproteins (PBPs). The latter constitute the building bricks of phycobilisomes, the major light-harvesting systems of cyanobacteria and red algae. Phycobilin lyases sequences are poorly annotated in public databases. Sequences included in CyanoLyase were retrieved from all available genomes of these organisms and a few others by similarity searches using biochemically characterized enzyme sequences and then classified into 3 clans and 32 families. Amino acid motifs were computed for each family using Protomata learner. CyanoLyase also includes BLAST and a novel pattern matching tool (Protomatch) that allow users to rapidly retrieve and annotate lyases from any new genome. In addition, it provides phylogenetic analyses of all phycobilin lyases families, describes their function, their presence/absence in all genomes of the database (phyletic profiles) and predicts the chromophorylation of PBPs in each strain. The site also includes a thorough bibliography about phycobilin lyases and genomes included in the database. This resource should be useful to scientists and companies interested in natural or artificial PBPs, which have a number of biotechnological applications, notably as fluorescent markers.


Asunto(s)
Bases de Datos de Proteínas , Liasas/química , Ficobilinas/metabolismo , Ficobiliproteínas/metabolismo , Secuencias de Aminoácidos , Cianobacterias/enzimología , Internet , Liasas/clasificación , Liasas/genética , Liasas/fisiología , Anotación de Secuencia Molecular , Rhodophyta/enzimología , Análisis de Secuencia de Proteína , Programas Informáticos
7.
Sci Rep ; 10(1): 8467, 2020 05 21.
Artículo en Inglés | MEDLINE | ID: mdl-32439871

RESUMEN

Staphylococcus aureus is an important opportunistic pathogen of humans and animals. It produces extracellular vesicles (EVs) that are involved in cellular communication and enable inter-kingdom crosstalk, the delivery of virulence factors and modulation of the host immune response. The protein content of EVs determines their biological functions. Clarifying which proteins are selected, and how, is of crucial value to understanding the role of EVs in pathogenesis and the development of molecular delivery systems. Here, we postulated that S. aureus EVs share a common proteome containing components involved in cargo sorting. The EV proteomes of five S. aureus strains originating from human, bovine, and ovine hosts were characterised. The clustering of EV proteomes reflected the diversity of the producing strains. A total of 253 proteins were identified, 119 of which composed a core EV proteome with functions in bacterial survival, pathogenesis, and putatively in EV biology. We also identified features in the sequences of EV proteins and the corresponding genes that could account for their packaging into EVs. Our findings corroborate the hypothesis of a selective sorting of proteins into EVs and offer new perspectives concerning the roles of EVs in S. aureus pathogenesis in specific host niches.


Asunto(s)
Proteínas Bacterianas/metabolismo , Biomarcadores/metabolismo , Vesículas Extracelulares/metabolismo , Proteoma/análisis , Proteoma/metabolismo , Infecciones Estafilocócicas/microbiología , Staphylococcus aureus/metabolismo , Animales , Bovinos , Humanos , Ovinos , Staphylococcus aureus/crecimiento & desarrollo , Staphylococcus aureus/aislamiento & purificación
8.
PeerJ ; 7: e6559, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-30918754

RESUMEN

Interactions between amino acids that are close in the spatial structure, but not necessarily in the sequence, play important structural and functional roles in proteins. These non-local interactions ought to be taken into account when modeling collections of proteins. Yet the most popular representations of sets of related protein sequences remain the profile Hidden Markov Models. By modeling independently the distributions of the conserved columns from an underlying multiple sequence alignment of the proteins, these models are unable to capture dependencies between the protein residues. Non-local interactions can be represented by using more expressive grammatical models. However, learning such grammars is difficult. In this work, we propose to use information on protein contacts to facilitate the training of probabilistic context-free grammars representing families of protein sequences. We develop the theory behind the introduction of contact constraints in maximum-likelihood and contrastive estimation schemes and implement it in a machine learning framework for protein grammars. The proposed framework is tested on samples of protein motifs in comparison with learning without contact constraints. The evaluation shows high fidelity of grammatical descriptors to protein structures and improved precision in recognizing sequences. Finally, we present an example of using our method in a practical setting and demonstrate its potential beyond the current state of the art by creating a grammatical model of a meta-family of protein motifs. We conclude that the current piece of research is a significant step towards more flexible and accurate modeling of collections of protein sequences. The software package is made available to the community.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA