Pesquisa | BVS - MINISTÉRIO DA SAÚDE

Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering.

Barone, Federico; Russo, Elena Tea; Villegas Garcia, Edith Natalia; Punta, Marco; Cozzini, Stefano; Ansuini, Alessio; Cazzaniga, Alberto.

Sci Data ; 11(1): 568, 2024 Jun 01.

Artigo em Inglês | MEDLINE | ID: mdl-38824125

RESUMO

Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository.

Assuntos

Proteoma , Humanos , Trato Gastrointestinal/metabolismo , Análise por Conglomerados , Anotação de Sequência Molecular , Metagenômica , Bases de Dados de Proteínas

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets.

Russo, Elena Tea; Barone, Federico; Bateman, Alex; Cozzini, Stefano; Punta, Marco; Laio, Alessandro.

PLoS Comput Biol ; 18(10): e1010610, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-36260616

RESUMO

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds â¼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.

Assuntos

Proteínas , Bases de Dados de Proteínas , Proteínas/genética , Análise por Conglomerados , Sequência de Aminoácidos , Domínios Proteicos

Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation.

Russo, Elena Tea; Laio, Alessandro; Punta, Marco.

BMC Bioinformatics ; 22(1): 121, 2021 Mar 12.

Artigo em Inglês | MEDLINE | ID: mdl-33711918

RESUMO

BACKGROUND: The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence. RESULTS: We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results. CONCLUSIONS: The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.

Assuntos

Proteínas , Alinhamento de Sequência , Sequência de Aminoácidos , Análise por Conglomerados , Bases de Dados de Proteínas , Humanos , Proteínas/genética

The intrinsic dimension of protein sequence evolution.

Facco, Elena; Pagnani, Andrea; Russo, Elena Tea; Laio, Alessandro.

PLoS Comput Biol ; 15(4): e1006767, 2019 04.

Artigo em Inglês | MEDLINE | ID: mdl-30958823

RESUMO

It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences.

Assuntos

Evolução Molecular , Proteínas/química , Proteínas/genética , Sequência de Aminoácidos , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Modelos Moleculares , Mutação , Filogenia , Conformação Proteica , Dobramento de Proteína , Proteínas/classificação , Homologia de Sequência de Aminoácidos , Homologia Estrutural de Proteína

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA