Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
Int J Mol Sci ; 25(2)2024 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-38256255

RESUMO

SpliceProt 2.0 is a public proteogenomics database that aims to list the sequence of known proteins and potential new proteoforms in human, mouse, and rat proteomes. This updated repository provides an even broader range of computationally translated proteins and serves, for example, to aid with proteomic validation of splice variants absent from the reference UniProtKB/SwissProt database. We demonstrate the value of SpliceProt 2.0 to predict orthologous proteins between humans and murines based on transcript reconstruction, sequence annotation and detection at the transcriptome and proteome levels. In this release, the annotation data used in the reconstruction of transcripts based on the methodology of ternary matrices were acquired from new databases such as Ensembl, UniProt, and APPRIS. Another innovation implemented in the pipeline is the exclusion of transcripts predicted to be susceptible to degradation through the NMD pathway. Taken together, our repository and its applications represent a valuable resource for the proteogenomics community.


Assuntos
Proteogenômica , Proteômica , Ratos , Camundongos , Humanos , Animais , Bases de Dados de Proteínas , Bases de Conhecimento , Proteoma/genética
2.
Brief Bioinform ; 20(2): 463-470, 2019 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-29040399

RESUMO

Protein databases are steadily growing driven by the spread of new more efficient sequencing techniques. This growth is dominated by an increase in redundancy (homologous proteins with various degrees of sequence similarity) and by the incapability to process and curate sequence entries as fast as they are created. To understand these trends and aid bioinformatic resources that might be compromised by the increasing size of the protein sequence databases, we have created a less-redundant protein data set. In parallel, we analyzed the evolution of protein sequence databases in terms of size and redundancy. While the SwissProt database has decelerated its growth mostly because of a focus on increasing the level of annotation of its sequences, its counterpart TrEMBL, much less limited by curation steps, is still in a phase of accelerated growth. However, we predict that before 2020, almost all entries deposited in UniProtKB will be homologous to known proteins. We propose that new sequencing projects can be made more useful if they are driven to sequencing voids, parts of the tree of life far from already sequenced species or model organisms. We show these voids are present in the Archaea and Eukarya domains of life. The approach to the certainty of the redundancy of new protein sequence entries leads to the consideration that most of the protein diversity on Earth has already been described, which we estimate to be of around 3.75 million proteins, revising down the prediction we did a decade ago.


Assuntos
Bases de Dados de Proteínas , Proteínas/análise , Proteoma/análise , Análise de Sequência de Proteína/métodos , Animais , Biologia Computacional , Humanos , Bases de Conhecimento , Proteínas/classificação , Software
3.
Prog Med Chem ; 60: 273-343, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34147204

RESUMO

Molecular docking has become an important component of the drug discovery process. Since first being developed in the 1980s, advancements in the power of computer hardware and the increasing number of and ease of access to small molecule and protein structures have contributed to the development of improved methods, making docking more popular in both industrial and academic settings. Over the years, the modalities by which docking is used to assist the different tasks of drug discovery have changed. Although initially developed and used as a standalone method, docking is now mostly employed in combination with other computational approaches within integrated workflows. Despite its invaluable contribution to the drug discovery process, molecular docking is still far from perfect. In this chapter we will provide an introduction to molecular docking and to the different docking procedures with a focus on several considerations and protocols, including protonation states, active site waters and consensus, that can greatly improve the docking results.


Assuntos
Descoberta de Drogas/métodos , Simulação de Acoplamento Molecular , Proteínas/química , Proteínas/metabolismo , Ligação Proteica , Conformação Proteica , Relação Estrutura-Atividade
4.
Proteomics ; 19(24): e1800429, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31578773

RESUMO

Lake trout are used as bioindicators for toxics exposure in the Great Lakes ecosystem. Here the first lake trout (Salvelinus namaycush) liver proteomics study is performed and searched against specific databases: (NCBI and UniProtKB) Salvelinus, Salmonidae, Actinopterygii, and Oncorhynchus mykiss, and the more distant relative, Danio rerio. In the biological replicate 1 (BR1), technical replicate 1 (TR1), (BR1TR1), a large number of lake trout liver proteins are not in the Salvelinus protein database, suggesting that lake trout liver proteins have homology to some proteins from the Salmonidae family and Actinopterygii class, and to Oncorhynchus mykiss and Danio rerio, two more highly studied fish. In the NCBI search, 4194 proteins are identified: 3069 proteins in Actinopterygii, 1617 in Salmonidae, 68 in Salvelinus, 568 in Oncorhynchus mykiss, and 946 in Danio rerio protein databases. Similar results are observed in the UniProtKB searches of BR1RT1, as well as in a technical replicate (BR1TR2), and then in a second biological replicate experiment, with two technical replicates (BR2TR1 and BR2TR2). This study opens the possibility of identifying evolutionary relationships (i.e., adaptive mutations) between various groups (i.e., zebrafish, rainbow trout, Salmonidae, Salvelinus and lake trout) through evolutionary proteomics. Data are available via the PRIDE Q2 (PXD011924).


Assuntos
Evolução Molecular , Proteínas de Peixes/metabolismo , Fígado/metabolismo , Proteoma/análise , Proteômica/métodos , Salmonidae/metabolismo , Animais , Salmonidae/classificação , Salmonidae/crescimento & desenvolvimento
5.
Subcell Biochem ; 89: 47-66, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30378018

RESUMO

The current view on peroxisomes has changed dramatically from being human cell oddities to vital organelles that host several key metabolic pathways. To fulfil over 50 different enzymatic functions, human peroxisomes host either unique peroxisomal proteins or dual-localized proteins. The identification and characterization of the complete peroxisomal proteome in humans is important for diagnosis and treatment of patients with peroxisomal disorders as well as for uncovering novel peroxisomal functions and regulatory modules. Hence, here we compiled a comprehensive list of mammalian peroxisomal and peroxisome-associated proteins by curating results of several quantitative and non-quantitative proteomic studies together with entries in the UniProtKB and Compartments knowledge channel databases. Our analysis gives a holistic view on the mammalian peroxisomal proteome and brings to light potential new peroxisomal and peroxisome-associated proteins. We believe that this dataset, represents a valuable surrogate map of the human peroxisomal proteome.


Assuntos
Peroxissomos/metabolismo , Proteoma/análise , Proteoma/metabolismo , Proteômica , Animais , Humanos , Redes e Vias Metabólicas , Transtornos Peroxissômicos/metabolismo
6.
BMC Bioinformatics ; 19(1): 422, 2018 Nov 12.
Artigo em Inglês | MEDLINE | ID: mdl-30419809

RESUMO

BACKGROUND: The discovery of functionally conserved proteins is a tough and important task in system biology. Global network alignment provides a systematic framework to search for these proteins from multiple protein-protein interaction (PPI) networks. Although there exist many web servers for network alignment, no one allows to perform global multiple network alignment tasks on users' test datasets. RESULTS: Here, we developed a web server WebNetcoffee based on the algorithm of NetCoffee to search for a global network alignment from multiple networks. To build a series of online test datasets, we manually collected 218,339 proteins, 4,009,541 interactions and many other associated protein annotations from several public databases. All these datasets and alignment results are available for download, which can support users to perform algorithm comparison and downstream analyses. CONCLUSION: WebNetCoffee provides a versatile, interactive and user-friendly interface for easily running alignment tasks on both online datasets and users' test datasets, managing submitted jobs and visualizing the alignment results through a web browser. Additionally, our web server also facilitates graphical visualization of induced subnetworks for a given protein and its neighborhood. To the best of our knowledge, it is the first web server that facilitates the performing of global alignment for multiple PPI networks. AVAILABILITY: http://www.nwpu-bioinformatics.com/WebNetCoffee.


Assuntos
Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Humanos
7.
Molecules ; 22(12)2017 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-29244774

RESUMO

Many studies have used position-specific scoring matrices (PSSM) profiles to characterize residues in protein structures and to predict a broad range of protein features. Moreover, PSSM profiles of Protein Data Bank (PDB) entries have been recalculated in many works for different purposes. Although the computational cost of calculating a single PSSM profile is affordable, many statistical studies or machine learning-based methods used thousands of profiles to achieve their goals, thereby leading to a substantial increase of the computational cost. In this work we present a new database compiling PSSM profiles for the proteins of the PDB. Currently, the database contains 333,532 protein chain profiles involving 123,135 different PDB entries.


Assuntos
Bases de Dados de Proteínas , Matrizes de Pontuação de Posição Específica , Proteínas/química , Conformação Proteica , Software
8.
Adv Exp Med Biol ; 926: 77-91, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27686807

RESUMO

Identification of mutant proteins in biological samples is one of the emerging areas of proteogenomics. Despite the fact that only a limited number of studies have been published up to now, it has the potential to recognize novel disease biomarkers that have unique structure and desirably high specificity. Such properties would identify mutant proteoforms related to diseases as optimal drug targets useful for future therapeutic strategies. While mass spectrometry has demonstrated its outstanding analytical power in proteomics, the most frequently applied bottom-up strategy is not suitable for the detection of mutant proteins if only databases with consensus sequences are searched. It is likely that many unassigned tandem mass spectra of tryptic peptides originate from single amino acid variants (SAAVs). To address this problem, a couple of protein databases have been constructed that include canonical and SAAV sequences, allowing for the observation of mutant proteoforms in mass spectral data for the first time. Since the resulting large search space may compromise the probability of identifications, a novel concept was proposed that included identification as well as verification strategies. Together with transcriptome based approaches, targeted proteomics appears to be a suitable method for the verification of initial identifications in databases and can also provide quantitative insights to expression profiles, which often reflect disease progression. Important applications in the field of mutant proteoform identification have already highlighted novel biomarkers in large-scale investigations.


Assuntos
Bases de Dados de Proteínas/estatística & dados numéricos , Proteínas Mutantes/análise , Mutação , Fragmentos de Peptídeos/isolamento & purificação , Proteogenômica/métodos , Sequência de Aminoácidos , Substituição de Aminoácidos , Humanos , Proteínas Mutantes/genética , Mapeamento de Peptídeos , Polimorfismo de Nucleotídeo Único , Proteogenômica/instrumentação , Proteólise , Espectrometria de Massas em Tandem , Tripsina/química
9.
Methods Mol Biol ; 2780: 129-138, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38987467

RESUMO

Protein-protein interactions (PPIs) provide valuable insights for understanding the principles of biological systems and for elucidating causes of incurable diseases. One of the techniques used for computational prediction of PPIs is protein-protein docking calculations, and a variety of software has been developed. This chapter is a summary of software and databases used for protein-protein docking.


Assuntos
Bases de Dados de Proteínas , Simulação de Acoplamento Molecular , Mapeamento de Interação de Proteínas , Proteínas , Software , Mapeamento de Interação de Proteínas/métodos , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Ligação Proteica , Humanos
10.
mSystems ; 8(4): e0067822, 2023 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-37350639

RESUMO

Metaproteomics, a method for untargeted, high-throughput identification of proteins in complex samples, provides functional information about microbial communities and can tie functions to specific taxa. Metaproteomics often generates less data than other omics techniques, but analytical workflows can be improved to increase usable data in metaproteomic outputs. Identification of peptides in the metaproteomic analysis is performed by comparing mass spectra of sample peptides to a reference database of protein sequences. Although these protein databases are an integral part of the metaproteomic analysis, few studies have explored how database composition impacts peptide identification. Here, we used cervicovaginal lavage (CVL) samples from a study of bacterial vaginosis (BV) to compare the performance of databases built using six different strategies. We evaluated broad versus sample-matched databases, as well as databases populated with proteins translated from metagenomic sequencing of the same samples versus sequences from public repositories. Smaller sample-matched databases performed significantly better, driven by the statistical constraints on large databases. Additionally, large databases attributed up to 34% of significant bacterial hits to taxa absent from the sample, as determined orthogonally by 16S rRNA gene sequencing. We also tested a set of hybrid databases which included bacterial proteins from NCBI RefSeq and translated bacterial genes from the samples. These hybrid databases had the best overall performance, identifying 1,068 unique human and 1,418 unique bacterial proteins, ~30% more than a database populated with proteins from typical vaginal bacteria and fungi. Our findings can help guide the optimal identification of proteins while maintaining statistical power for reaching biological conclusions. IMPORTANCE Metaproteomic analysis can provide valuable insights into the functions of microbial and cellular communities by identifying a broad, untargeted set of proteins. The databases used in the analysis of metaproteomic data influence results by defining what proteins can be identified. Moreover, the size of the database impacts the number of identifications after accounting for false discovery rates (FDRs). Few studies have tested the performance of different strategies for building a protein database to identify proteins from metaproteomic data and those that have largely focused on highly diverse microbial communities. We tested a range of databases on CVL samples and found that a hybrid sample-matched approach, using publicly available proteins from organisms present in the samples, as well as proteins translated from metagenomic sequencing of the samples, had the best performance. However, our results also suggest that public sequence databases will continue to improve as more bacterial genomes are published.


Assuntos
Microbiota , Proteômica , Feminino , Humanos , RNA Ribossômico 16S/genética , Proteômica/métodos , Microbiota/genética , Proteínas de Bactérias/genética , Peptídeos/metabolismo , Bactérias
11.
Adv Biol (Weinh) ; 7(6): e2200232, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-36775876

RESUMO

Peptides have shown increasing advantages and significant clinical value in drug discovery and development. With the development of high-throughput technologies and artificial intelligence (AI), machine learning (ML) methods for discovering new lead peptides have been expanded and incorporated into rational drug design. Predictions of peptide-protein interactions (PepPIs) and protein-protein interactions (PPIs) are both opportunities and challenges in computational biology, which will help to better understand the mechanisms of disease and provide the impetus for the discovery of lead peptides. This paper comprehensively reviews computational models for PepPI and PPI predictions. It begins with an introduction of various databases of peptide ligands and target proteins. Then it discusses data formats and feature representations for proteins and peptides. Furthermore, classical ML methods and emerging deep learning (DL) methods that can be used to train prediction models of PepPI and PPI are classified into four categories, and their advantages and disadvantages are analyzed. To assess the relative performance of different models, different validation protocols and evaluation indexes are discussed. The goal of this review is to help researchers quickly get started to develop computational frameworks using these integrated resources and eventually promote the discovery of lead peptides.


Assuntos
Inteligência Artificial , Peptídeos , Proteínas/metabolismo , Aprendizado de Máquina , Descoberta de Drogas
12.
Genome Biol ; 23(1): 132, 2022 06 20.
Artigo em Inglês | MEDLINE | ID: mdl-35725496

RESUMO

BACKGROUND: Proteogenomics aims to identify variant or unknown proteins in bottom-up proteomics, by searching transcriptome- or genome-derived custom protein databases. However, empirical observations reveal that these large proteogenomic databases produce lower-sensitivity peptide identifications. Various strategies have been proposed to avoid this, including the generation of reduced transcriptome-informed protein databases, which only contain proteins whose transcripts are detected in the sample-matched transcriptome. These were found to increase peptide identification sensitivity. Here, we present a detailed evaluation of this approach. RESULTS: We establish that the increased sensitivity in peptide identification is in fact a statistical artifact, directly resulting from the limited capability of target-decoy competition to accurately model incorrect target matches when using excessively small databases. As anti-conservative false discovery rates (FDRs) are likely to hamper the robustness of the resulting biological conclusions, we advocate for alternative FDR control methods that are less sensitive to database size. Nevertheless, reduced transcriptome-informed databases are useful, as they reduce the ambiguity of protein identifications, yielding fewer shared peptides. Furthermore, searching the reference database and subsequently filtering proteins whose transcripts are not expressed reduces protein identification ambiguity to a similar extent, but is more transparent and reproducible. CONCLUSIONS: In summary, using transcriptome information is an interesting strategy that has not been promoted for the right reasons. While the increase in peptide identifications from searching reduced transcriptome-informed databases is an artifact caused by the use of an FDR control method unsuitable to excessively small databases, transcriptome information can reduce the ambiguity of protein identifications.


Assuntos
Proteogenômica , Proteômica , Bases de Dados de Proteínas , Eucariotos , Peptídeos , Proteínas , Proteogenômica/métodos , Proteômica/métodos , Transcriptoma
14.
Methods Mol Biol ; 2139: 57-68, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32462577

RESUMO

Proteomics encompasses efforts to identify all the proteins of a proteome, with most of studies about plant proteomics based on a bottom-up mass spectrometry (MS) strategy, in which the proteins are subjected to digestion by trypsin and the tryptic fragments are subjected to MS analysis. The identification of proteins from MS/MS spectra has been performed using different algorithms (Mascot, Sequest) against plant protein sequence databases such as UniProtKB or NCBI_Viridiplantae. But these databases are not the best choice for nonmodel species where they are underrepresented, resulting in poor identification rates. A high identification rate requires a sequenced and well-annotated genome of the species under investigation. For nonmodel organisms, the identification of proteins is challenging since, in the best of the cases, only hits or orthologs instead of gene products are identified. However, in the absence of a sequenced genome, this situation can be improved using transcriptome data to generate a specific species database to compare proteins. In this chapter, we report the protein database construction from RNA-Seq data in a nonmodel species, in this particular case Holm oak (Q. ilex).


Assuntos
Quercus/genética , Transcriptoma/genética , Biologia Computacional/métodos , Bases de Dados de Proteínas , Proteínas/genética , Proteoma/genética , Proteômica/métodos , Análise de Sequência de RNA/métodos , Espectrometria de Massas em Tandem/métodos
15.
Mol Ecol Resour ; 17(6): 1148-1155, 2017 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-28130873

RESUMO

Recent technological advances have increased the throughput of proteomics, facilitating the characterization of molecular phenotypes on the population level, thus bearing the potential to complement transcriptomic analyses. Reference protein databases are crucial for the analysis and quantification, because only peptides in the protein database can be identified. Any peptide carrying an amino acid variant cannot be identified. Because most proteomic studies, even of natural populations, do not account for polymorphisms, we analysed the influence of variant peptides on quantitative proteomic analyses. We used transcriptomic and proteomic data of two Drosophila melanogaster genotypes and identified genotype-specific variants from RNA-seq data. We introduce a simple pipeline to include these variants in a polymorphism-aware protein database and compared the results to an unmodified reference database. The polymorphism-aware database not only identifies more peptides, but the quantitative values also changed when peptide variants were included. We conclude that proteomic quantification is likely to be biased, in particular for small genes, when polymorphisms are being ignored. Polymorphism-aware databases may be therefore a key step towards improved proteomic data analyses, especially for the analysis of pooled individuals and the comparison of population samples.


Assuntos
Bases de Dados de Proteínas , Proteínas de Drosophila/análise , Proteínas de Drosophila/genética , Variação Genética , Proteômica/métodos , Animais , Drosophila melanogaster/classificação , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Genótipo , Análise de Sequência de RNA
16.
Methods Mol Biol ; 1543: 169-185, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28349426

RESUMO

Experimental methods for identifying protein(s) bound by a specific promoter-associated RNA (paRNA) of interest can be expensive, difficult, and time-consuming. This chapter describes a general computational framework for identifying potential binding partners in RNA-protein complexes or RNA-protein interaction networks. Protocols for using three web-based tools to predict RNA-protein interaction partners are outlined. Also, tables listing additional webservers and software tools for predicting RNA-protein interactions, as well as databases that contain valuable information about known RNA-protein complexes and recognition sites for RNA-binding proteins, are provided. Although only one of the tools described, lncPro, was designed expressly to identify proteins that bind long noncoding RNAs (including paRNAs), all three approaches can be applied to predict potential binding partners for both coding and noncoding RNAs (ncRNAs).


Assuntos
Biologia Computacional/métodos , Proteínas de Ligação a RNA/química , Proteínas de Ligação a RNA/metabolismo , RNA/química , RNA/metabolismo , Software , Sítios de Ligação , Simulação por Computador , Bases de Dados Genéticas , Ligação Proteica , RNA/genética , Ferramenta de Busca , Máquina de Vetores de Suporte , Navegador
17.
Nonlinear Anal Theory Methods Appl ; 65(5): 1070-1093, 2006 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-32288048

RESUMO

A hybrid evolutionary model is used to propose a hierarchical homology of protein sequences to identify protein functions systematically. The proposed model offers considerable potentials, considering the inconsistency of existing methods for predicting novel proteins. Because some novel proteins might align without meaningful conserved domains, maximizing the score of sequence alignment is not the best criterion for predicting protein functions. This work presents a decision model that can minimize the cost of making a decision for predicting protein functions using the hierarchical homologies. Particularly, the model has three characteristics: (i) it is a hybrid evolutionary model with multiple fitness functions that uses genetic programming to predict protein functions on a distantly related protein family, (ii) it incorporates modified robust point matching to accurately compare all feature points using the moment invariant and thin-plate spline theorems, and (iii) the hierarchical homologies holding up a novel protein sequence in the form of a causal tree can effectively demonstrate the relationship between proteins. This work describes the comparisons of nucleocapsid proteins from the putative polyprotein SARS virus and other coronaviruses in other hosts using the model.

18.
Annu Rev Anal Chem (Palo Alto Calif) ; 9(1): 521-45, 2016 06 12.
Artigo em Inglês | MEDLINE | ID: mdl-27049631

RESUMO

Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.


Assuntos
Variação Genética/genética , Espectrometria de Massas , Proteínas/química , Proteínas/genética , Proteogenômica , Sequência de Bases/genética , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA