RESUMO
This year's Database Issue of Nucleic Acids Research contains 152 papers that include descriptions of 54 new databases and update papers on 98 databases, of which 16 have not been previously featured in NAR As always, these databases cover a broad range of molecular biology subjects, including genome structure, gene expression and its regulation, proteins, protein domains, and protein-protein interactions. Following the recent trend, an increasing number of new and established databases deal with the issues of human health, from cancer-causing mutations to drugs and drug targets. In accordance with this trend, three recently compiled databases that have been selected by NAR reviewers and editors as 'breakthrough' contributions, denovo-db, the Monarch Initiative, and Open Targets, cover human de novo gene variants, disease-related phenotypes in model organisms, and a bioinformatics platform for therapeutic target identification and validation, respectively. We expect these databases to attract the attention of numerous researchers working in various areas of genetics and genomics. Looking back at the past 12 years, we present here the 'golden set' of databases that have consistently served as authoritative, comprehensive, and convenient data resources widely used by the entire community and offer some lessons on what makes a successful database. The Database Issue is freely available online at the https://academic.oup.com/nar web site. An updated version of the NAR Molecular Biology Database Collection is available at http://www.oxfordjournals.org/nar/database/a/.
Assuntos
Bases de Dados de Ácidos Nucleicos/tendências , Bases de Dados de Proteínas/tendências , Bases de Dados de Compostos Químicos/tendências , Genômica , HumanosRESUMO
Identification of all phosphorylation forms of known proteins is a major goal of the Chromosome-Centric Human Proteome Project (C-HPP). Recent studies have found that certain phosphoproteins can be encapsulated in exosomes and function as key regulators in tumor microenvironment, but no deep coverage phosphoproteome of human exosomes has been reported to date, which makes the exosome a potential source for the new phosphosite discovery. In this study, we performed highly optimized MS analyses on the exosomal and cellular proteins isolated from human colorectal cancer SW620 cells. With stringent data quality control, 313 phosphoproteins with 1091 phosphosites were confidently identified from the SW620 exosome, from which 202 new phosphosites were detected. Exosomal phosphoproteins were significantly enriched in the 11q12.1-13.5 region of chromosome 11 and had a remarkably high level of tyrosine-phosphorylated proteins (6.4%), which were functionally relevant to ephrin signaling pathway-directed cytoskeleton remodeling. In conclusion, we here report the first high-coverage phosphoproteome of human cell-secreted exosomes, which leads to the identification of new phosphosites for C-HPP. Our findings provide insights into the exosomal phosphoprotein systems that help to understand the signaling language being delivered by exosomes in cell-cell communications. The mass spectrometry proteomics data have been deposited to the ProteomeXchange consortium with the data set identifier PXD004079, and iProX database (accession number: IPX00076800).
Assuntos
Neoplasias Colorretais/patologia , Bases de Dados de Proteínas/tendências , Exossomos , Fosfoproteínas/análise , Proteoma/genética , Comunicação Celular , Linhagem Celular Tumoral , Cromossomos Humanos Par 11/genética , Neoplasias Colorretais/genética , Projeto Genoma Humano , Humanos , Espectrometria de Massas , Proteínas de Neoplasias , Fosfopeptídeos/análise , Fosfoproteínas/genética , Proteômica/métodos , Transdução de SinaisRESUMO
The review covers about fifty years of progress in "proteome" analysis, starting from primitive two-dimensional (2D) map attempts in the early sixties of last century. The polar star in 2D mapping arose in 1975 with the classic paper by O'Farrell in J Biol. Chem. It became the compass for all proteome navigators. Perfection came, though, only with the introduction of immobilized pH gradients, which fixed the polypeptide spots in the 2D plane. Great impetus in proteome analysis came with the introduction of informatic tools and creating databases, among which Swiss Prot remains the site of excellence. Towards the end of the nineties, 2D chromatography, epitomized by coupling strong cation exchangers with C18 resins, began to be a serious challenge to electrophoretic 2D mapping, although up to the present both techniques are still much in vogue and appear to give complementary results. Yet the migration of "proteomics" into the third millennium was made possible only by mass spectrometry (MS), which today represents the standard analytical tool in any lab dealing with proteomic analysis. Another major improvement has been the introduction of combinatorial peptide ligand libraries (CPLL), which, when properly used, enhance the visibility of low-abundance species by 3 to 4 orders of magnitude. Coupling MS to CPLLs permits the exploration of at least 8 orders of magnitude in dynamic range on any proteome. BIOLOGICAL SIGNIFICANCE: The present review is a personal recollection highlighting the developments that led to present-day proteomics on a long march that lasted about 50years. It is meant to give to young scientists an overview on how science grows, which ones are the quantum jumps in science and which research is of particular significance in general and in the field of proteomics in particular. It also gives some real-life episodes of greater-than-life figures. As such, it can be viewed as a tutorial to stimulate the young generation to be creative (and use their imagination too!).This article is part of a Special Issue entitled: 20years of Proteomics in memory of Viatliano Pallini. Guest Editors: Luca Bini, Juan J. Calvete, Natacha Turck, Denis Hochstrasser and Jean-Charles Sanchez.
Assuntos
Proteômica/história , Proteômica/métodos , Proteômica/tendências , Bases de Dados de Proteínas/história , Bases de Dados de Proteínas/tendências , História do Século XX , História do Século XXI , Humanos , Biblioteca de Peptídeos , Proteômica/instrumentaçãoRESUMO
BACKGROUND: Protein kinases are involved in relevant physiological functions and a broad number of mutations in this superfamily have been reported in the literature to affect protein function and stability. Unfortunately, the exploration of the consequences on the phenotypes of each individual mutation remains a considerable challenge. RESULTS: The wKinMut web-server offers direct prediction of the potential pathogenicity of the mutations from a number of methods, including our recently developed prediction method based on the combination of information from a range of diverse sources, including physicochemical properties and functional annotations from FireDB and Swissprot and kinase-specific characteristics such as the membership to specific kinase groups, the annotation with disease-associated GO terms or the occurrence of the mutation in PFAM domains, and the relevance of the residues in determining kinase subfamily specificity from S3Det. This predictor yields interesting results that compare favourably with other methods in the field when applied to protein kinases.Together with the predictions, wKinMut offers a number of integrated services for the analysis of mutations. These include: the classification of the kinase, information about associations of the kinase with other proteins extracted from iHop, the mapping of the mutations onto PDB structures, pathogenicity records from a number of databases and the classification of mutations in large-scale cancer studies. Importantly, wKinMut is connected with the SNP2L system that extracts mentions of mutations directly from the literature, and therefore increases the possibilities of finding interesting functional information associated to the studied mutations. CONCLUSIONS: wKinMut facilitates the exploration of the information available about individual mutations by integrating prediction approaches with the automatic extraction of information from the literature (text mining) and several state-of-the-art databases.wKinMut has been used during the last year for the analysis of the consequences of mutations in the context of a number of cancer genome projects, including the recent analysis of Chronic Lymphocytic Leukemia cases and is publicly available at http://wkinmut.bioinfo.cnio.es.
Assuntos
Biologia Computacional/métodos , Leucemia Linfocítica Crônica de Células B/enzimologia , Leucemia Linfocítica Crônica de Células B/genética , Mutação/genética , Proteínas Quinases/química , Bases de Dados de Proteínas/tendências , Receptores ErbB/genética , Humanos , Armazenamento e Recuperação da Informação/métodos , Leucemia Linfocítica Crônica de Células B/etiologia , Fenótipo , Valor Preditivo dos Testes , Proteínas Quinases/classificação , Proteínas Quinases/genética , Estabilidade ProteicaRESUMO
High-throughput genome sequencing continues to accelerate the rate at which complete genomes are available for biological research. Many of these new genome sequences have little or no genome annotation currently available and hence rely upon computational predictions of protein coding genes. Evidence of translation from proteomic techniques could facilitate experimental validation of protein coding genes, but the techniques for whole genome searching with MS/MS data have not been adequately developed to date. Here we describe GENQUEST, a novel method using peptide isoelectric focusing and accurate mass to greatly reduce the peptide search space, making fast, accurate, and sensitive whole human genome searching possible on common desktop computers. In an initial experiment, almost all exonic peptides identified in a protein database search were identified when searching genomic sequence. Many peptides identified exclusively in the genome searches were incorrectly identified or could not be experimentally validated, highlighting the importance of orthogonal validation. Experimentally validated peptides exclusive to the genomic searches can be used to reannotate protein coding genes. GENQUEST represents an experimental tool that can be used by the proteomics community at large for validating computational approaches to genome annotation.
Assuntos
Bases de Dados de Proteínas/tendências , Documentação/métodos , Proteômica/métodos , Espectrometria de Massas em Tandem/métodos , Linhagem Celular Tumoral , Genoma Humano , Genômica/métodos , Humanos , Focalização IsoelétricaRESUMO
BACKGROUND: SUPFAM database is a compilation of superfamily relationships between protein domain families of either known or unknown 3-D structure. In SUPFAM, sequence families from Pfam and structural families from SCOP are associated, using profile matching, to result in sequence superfamilies of known structure. Subsequently all-against-all family profile matches are made to deduce a list of new potential superfamilies of yet unknown structure. DESCRIPTION: The current version of SUPFAM (release 1.4) corresponds to significant enhancements and major developments compared to the earlier and basic version. In the present version we have used RPS-BLAST, which is robust and sensitive, for profile matching. The reliability of connections between protein families is ensured better than before by use of benchmarked criteria involving strict e-value cut-off and a minimal alignment length condition. An e-value based indication of reliability of connections is now presented in the database. Web access to a RPS-BLAST-based tool to associate a query sequence to one of the family profiles in SUPFAM is available with the current release. In terms of the scientific content the present release of SUPFAM is entirely reorganized with the use of 6190 Pfam families and 2317 structural families derived from SCOP. Due to a steep increase in the number of sequence and structural families used in SUPFAM the details of scientific content in the present release are almost entirely complementary to previous basic version. Of the 2286 families, we could relate 245 Pfam families with apparently no structural information to families of known 3-D structures, thus resulting in the identification of new families in the existing superfamilies. Using the profiles of 3904 Pfam families of yet unknown structure, an all-against-all comparison involving sequence-profile match resulted in clustering of 96 Pfam families into 39 new potential superfamilies. CONCLUSION: SUPFAM presents many non-trivial superfamily relationships of sequence families involved in a variety of functions and hence the information content is of interest to a wide scientific community. The grouping of related proteins without a known structure in SUPFAM is useful in identifying priority targets for structural genomics initiatives and in the assignment of putative functions. Database URL: http://pauling.mbu.iisc.ernet.in/~supfam.
Assuntos
Sequência de Aminoácidos , Bases de Dados de Proteínas/tendências , Peptídeos/química , Proteínas/química , Biologia Computacional/métodos , Estrutura Terciária de ProteínaRESUMO
Human Protein Reference Database (HPRD) is an object database that integrates a wealth of information relevant to the function of human proteins in health and disease. Data pertaining to thousands of protein-protein interactions, posttranslational modifications, enzyme/substrate relationships, disease associations, tissue expression, and subcellular localization were extracted from the literature for a nonredundant set of 2750 human proteins. Almost all the information was obtained manually by biologists who read and interpreted >300,000 published articles during the annotation process. This database, which has an intuitive query interface allowing easy access to all the features of proteins, was built by using open source technologies and will be freely available at http://www.hprd.org to the academic community. This unified bioinformatics platform will be useful in cataloging and mining the large number of proteomic interactions and alterations that will be discovered in the postgenomic era.
Assuntos
Bases de Dados de Proteínas/tendências , Proteína BRCA1/fisiologia , Biologia Computacional/métodos , Genética Médica/métodos , Humanos , Substâncias Macromoleculares , Mapeamento de Interação de Proteínas/tendências , Processamento de Proteína Pós-Traducional/fisiologia , Estrutura Quaternária de Proteína/fisiologia , Estrutura Terciária de Proteína/fisiologia , Especificidade por Substrato/fisiologiaRESUMO
FACTS (Functional Association/Annotation of cDNA Clones from Text/Sequence Sources) is a semiautomated knowledge discovery and annotation system that integrates molecular function information derived from sequence analysis results (sequence inferred) with functional information extracted from text. Text-inferred information was extracted from keyword-based retrievals of MEDLINE abstracts and by matching of gene or protein names to OMIM, BIND, and DIP database entries. Using FACTS, we found that 47.5% of the 60,770 RIKEN mouse cDNA FANTOM2 clone annotations were informative for text searches. MEDLINE queries yielded molecular interaction-containing sentences for 23.1% of the clones. When disease MeSH and GO terms were matched with retrieved abstracts, 22.7% of clones were associated with potential diseases, and 32.5% with GO identifiers. A significant number (23.5%) of disease MeSH-associated clones were also found to have a hereditary disease association (OMIM Morbidmap). Inferred neoplastic and nervous system disease represented 49.6% and 36.0% of disease MeSH-associated clones, respectively. A comparison of sequence-based GO assignments with informative text-based GO assignments revealed that for 78.2% of clones, identical GO assignments were provided for that clone by either method, whereas for 21.8% of clones, the assignments differed. In contrast, for OMIM assignments, only 28.5% of clones had identical sequence-based and text-based OMIM assignments. Sequence, sentence, and term-based functional associations are included in the FACTS database (http://facts.gsc.riken.go.jp/), which permits results to be annotated and explored through web-accessible keyword and sequence search interfaces. The FACTS database will be a critical tool for investigating the functional complexity of the mouse transcriptome, cDNA-inferred interactome (molecular interactions), and pathome (pathologies).
Assuntos
DNA Complementar/fisiologia , Bases de Dados Genéticas/tendências , Gestão da Informação/métodos , Gestão da Informação/tendências , Software/tendências , Animais , DNA Complementar/genética , Bases de Dados de Proteínas/tendências , Armazenamento e Recuperação da Informação/métodos , Armazenamento e Recuperação da Informação/tendências , MEDLINE/tendências , Camundongos , Mapeamento de Interação de Proteínas/métodos , Mapeamento de Interação de Proteínas/tendênciasRESUMO
BACKGROUND: We wished to compare two databases based on sequence similarity: one that aims to be comprehensive in its coverage of known sequences, and one that specialises in a relatively small subset of known sequences. One of the motivations behind this study was quality control. Pfam is a comprehensive collection of alignments and hidden Markov models representing families of proteins and domains. MEROPS is a catalogue and classification of enzymes with proteolytic activity (peptidases or proteases). These secondary databases are used by researchers worldwide, yet their contents are not peer reviewed. Therefore, we hoped that a systematic comparison of the contents of Pfam and MEROPS would highlight missing members and false-positives leading to improvements in quality of both databases. An additional reason for carrying out this study was to explore the extent of consensus in the definition of a protein family. RESULTS: About half (89 out of 174) of the peptidase families in MEROPS overlapped single Pfam families. A further 32 MEROPS families overlapped multiple Pfam families. Where possible, new Pfam families were built to represent most of the MEROPS families that did not overlap Pfam. When comparing the numbers of sequences found in the overlap between a MEROPS family and its corresponding Pfam family, in most cases the overlap was substantial (52 pairs of MEROPS and Pfam families had an intersection size of greater than 75% of the union) but there were some differences in the sets of sequences included in the MEROPS families versus the overlapping Pfam families. CONCLUSIONS: A number of the discrepancies between MEROPS families and their corresponding Pfam families arose from differences in the aims and philosophies of the two databases. Examination of some of the discrepancies highlighted additional members of families, which have subsequently been added in both Pfam and MEROPS. This has led to improvements in the quality of both databases. Overall there was a great deal of consensus between the databases in definitions of a protein family.