RESUMO
MOPED (Multi-Omics Profiling Expression Database; http://moped.proteinspire.org) has transitioned from solely a protein expression database to a multi-omics resource for human and model organisms. Through a web-based interface, MOPED presents consistently processed data for gene, protein and pathway expression. To improve data quality, consistency and use, MOPED includes metadata detailing experimental design and analysis methods. The multi-omics data are integrated through direct links between genes and proteins and further connected to pathways and experiments. MOPED now contains over 5 million records, information for approximately 75,000 genes and 50,000 proteins from four organisms (human, mouse, worm, yeast). These records correspond to 670 unique combinations of experiment, condition, localization and tissue. MOPED includes the following new features: pathway expression, Pathway Details pages, experimental metadata checklists, experiment summary statistics and more advanced searching tools. Advanced searching enables querying for genes, proteins, experiments, pathways and keywords of interest. The system is enhanced with visualizations for comparing across different data types. In the future MOPED will expand the number of organisms, increase integration with pathways and provide connections to disease.
Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica , Proteômica , Animais , Humanos , Internet , Camundongos , Proteínas/genética , Proteínas/metabolismoRESUMO
Although biological science discovery often involves comparing conditions to a normal state, in proteomics little is actually known about normal. Two Human Proteome studies featured in Nature offer new insights into protein expression and an opportunity to assess how high-throughput proteomics measures normal protein ranges. We use data from these studies to estimate technical and biological variability in protein expression and compare them to other expression data sets from normal tissue. Results show that measured protein expression across same-tissue replicates vary by ±4- to 10-fold for most proteins. Coefficients of variation (CV) for protein expression measurements range from 62% to 117% across different tissue experiments; however, adjusting for technical variation reduced this variability by as much as 50%. In addition, the CV could also be reduced by limiting comparisons to proteins with at least 3 or more unique peptide identifications as the CV was on average 33% lower than for proteins with 2 or fewer peptide identifications. We also selected 13 housekeeping proteins and genes that were expressed across all tissues with low variability to determine their utility as a reference set for normalization and comparative purposes. These results present the first step toward estimating normal protein ranges by determining the variability in expression measurements through combining publicly available data. They support an approach that combines standard protocols with replicates of normal tissues to estimate normal protein ranges for large numbers of proteins and tissues. This would be a tremendous resource for normal cellular physiology and comparisons of proteomics studies.
Assuntos
Ensaios de Triagem em Larga Escala , Proteínas/metabolismo , Proteômica , Humanos , Valores de Referência , Reprodutibilidade dos TestesRESUMO
Data fully utilized by the community resources promote progress rather than repetition. Effective data sharing can accelerate the transition from data to actionable knowledge, yet barriers to data sharing remain, both technological and procedural. The DELSA community has tackled the sharing barrier by creating a multi-omics metadata checklist for the life sciences. The checklist and associated data publication examples are now jointly published in Big Data and OMICS: A Journal of Integrative Biology. The checklist will enable diverse datasets to be easily harmonized and reused for richer analyses. It will facilitate data deposits, stand alone as a data publication, and grant appropriate credit to researchers. We invite the broader life sciences community to test the checklist for feedback and improvements.
Assuntos
Lista de Checagem/estatística & dados numéricos , Biologia Computacional/organização & administração , Disseminação de Informação , Humanos , Editoração/organização & administraçãoRESUMO
The Model Organism Protein Expression Database (MOPED, http://moped.proteinspire.org) is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm, and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project's efforts to generate chromosome- and diseases-specific proteomes by providing links from proteins to chromosome and disease information as well as many complementary resources. MOPED supports a new omics metadata checklist to harmonize data integration, analysis, and use. MOPED's development is driven by the user community, which spans 90 countries and guides future development that will transform MOPED into a multiomics resource. MOPED encourages users to submit data in a simple format. They can use the metadata checklist to generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries.
Assuntos
Bases de Dados de Proteínas , Proteômica , Animais , Humanos , Interface Usuário-ComputadorRESUMO
Life science technologies generate a deluge of data that hold the keys to unlocking the secrets of important biological functions and disease mechanisms. We present DEAP, Differential Expression Analysis for Pathways, which capitalizes on information about biological pathways to identify important regulatory patterns from differential expression data. DEAP makes significant improvements over existing approaches by including information about pathway structure and discovering the most differentially expressed portion of the pathway. On simulated data, DEAP significantly outperformed traditional methods: with high differential expression, DEAP increased power by two orders of magnitude; with very low differential expression, DEAP doubled the power. DEAP performance was illustrated on two different gene and protein expression studies. DEAP discovered fourteen important pathways related to chronic obstructive pulmonary disease and interferon treatment that existing approaches omitted. On the interferon study, DEAP guided focus towards a four protein path within the 26 protein Notch signalling pathway.
Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Modelos Biológicos , Transdução de Sinais , Algoritmos , Simulação por Computador , Bases de Dados Genéticas , Doença/genética , Humanos , Reprodutibilidade dos TestesRESUMO
Large numbers of mass spectrometry proteomics studies are being conducted to understand all types of biological processes. The size and complexity of proteomics data hinders efforts to easily share, integrate, query and compare the studies. The Model Organism Protein Expression Database (MOPED, htttp://moped.proteinspire.org) is a new and expanding proteomics resource that enables rapid browsing of protein expression information from publicly available studies on humans and model organisms. MOPED is designed to simplify the comparison and sharing of proteomics data for the greater research community. MOPED uniquely provides protein level expression data, meta-analysis capabilities and quantitative data from standardized analysis. Data can be queried for specific proteins, browsed based on organism, tissue, localization and condition and sorted by false discovery rate and expression. MOPED empowers users to visualize their own expression data and compare it with existing studies. Further, MOPED links to various protein and pathway databases, including GeneCards, Entrez, UniProt, KEGG and Reactome. The current version of MOPED contains over 43,000 proteins with at least one spectral match and more than 11 million high certainty spectra.
Assuntos
Bases de Dados de Proteínas , Proteínas/metabolismo , Animais , Humanos , Espectrometria de Massas , Camundongos , Modelos Animais , Proteômica , Interface Usuário-ComputadorRESUMO
Since 1998, the bioinformatics, systems biology, genomics and medical communities have enjoyed a synergistic relationship with the GeneCards database of human genes (http://www.genecards.org). This human gene compendium was created to help to introduce order into the increasing chaos of information flow. As a consequence of viewing details and deep links related to specific genes, users have often requested enhanced capabilities, such that, over time, GeneCards has blossomed into a suite of tools (including GeneDecks, GeneALaCart, GeneLoc, GeneNote and GeneAnnot) for a variety of analyses of both single human genes and sets thereof. In this paper, we focus on inhouse and external research activities which have been enabled, enhanced, complemented and, in some cases, motivated by GeneCards. In turn, such interactions have often inspired and propelled improvements in GeneCards. We describe here the evolution and architecture of this project, including examples of synergistic applications in diverse areas such as synthetic lethality in cancer, the annotation of genetic variations in disease, omics integration in a systems biology approach to kidney disease, and bioinformatics tools.
Assuntos
Bases de Dados Genéticas , Genes/genética , Genoma Humano , Genômica , Biologia Computacional , HumanosRESUMO
MOTIVATION: Enrichment tests are used in high-throughput experimentation to measure the association between gene or protein expression and membership in groups or pathways. The Fisher's exact test is commonly used. We specifically examined the associations produced by the Fisher test between protein identification by mass spectrometry discovery proteomics, and their Gene Ontology (GO) term assignments in a large yeast dataset. We found that direct application of the Fisher test is misleading in proteomics due to the bias in mass spectrometry to preferentially identify proteins based on their biochemical properties. False inference about associations can be made if this bias is not corrected. Our method adjusts Fisher tests for these biases and produces associations more directly attributable to protein expression rather than experimental bias. RESULTS: Using logistic regression, we modeled the association between protein identification and GO term assignments while adjusting for identification bias in mass spectrometry. The model accounts for five biochemical properties of peptides: (i) hydrophobicity, (ii) molecular weight, (iii) transfer energy, (iv) beta turn frequency and (v) isoelectric point. The model was fit on 181 060 peptides from 2678 proteins identified in 24 yeast proteomics datasets with a 1% false discovery rate. In analyzing the association between protein identification and their GO term assignments, we found that 25% (134 out of 544) of Fisher tests that showed significant association (q-value ≤0.05) were non-significant after adjustment using our model. Simulations generating yeast protein sets enriched for identification propensity show that unadjusted enrichment tests were biased while our approach worked well.
Assuntos
Espectrometria de Massas/métodos , Proteínas/classificação , Proteômica/métodos , Proteínas Fúngicas/química , Proteínas Fúngicas/metabolismo , Interações Hidrofóbicas e Hidrofílicas , Modelos Logísticos , Peptídeos/química , Proteínas/química , Proteínas/genéticaRESUMO
MS-based proteomics characterizes protein contents of biological samples. The most common approach is to first match observed MS/MS peptide spectra against theoretical spectra from a protein sequence database and then to score these matches. The false discovery rate (FDR) can be estimated as a function of the score by searching together the protein sequence database and its randomized version and comparing the score distributions of the randomized versus nonrandomized matches. This work introduces a straightforward isotonic regression-based method to estimate the cumulative FDRs and local FDRs (LFDRs) of peptide identification. Our isotonic method not only performed as well as other methods used for comparison, but also has the advantages of being: (i) monotonic in the score, (ii) computationally simple, and (iii) not dependent on assumptions about score distributions. We demonstrate the flexibility of our approach by using it to estimate FDRs and LFDRs for protein identification using summaries of the peptide spectra scores. We reconfirmed that several of these methods were superior to a two-peptide rule. Finally, by estimating both the FDRs and LFDRs, we showed for both peptide and protein identification, moderate FDR values (5%) corresponded to large LFDR values (53 and 60%).
Assuntos
Biologia Computacional , Bases de Dados de Proteínas , Peptídeos/análise , Proteínas/análiseRESUMO
MOTIVATION: The false discovery rate (FDR) has been widely adopted to address the multiple comparisons issue in high-throughput experiments such as microarray gene-expression studies. However, while the FDR is quite useful as an approach to limit false discoveries within a single experiment, like other multiple comparison corrections it may be an inappropriate way to compare results across experiments. This article uses several examples based on gene-expression data to demonstrate the potential misinterpretations that can arise from using FDR to compare across experiments. Researchers should be aware of these pitfalls and wary of using FDR to compare experimental results. FDR should be augmented with other measures such as p-values and expression ratios. It is worth including standard error and variance information for meta-analyses and, if possible, the raw data for re-analyses. This is especially important for high-throughput studies because data are often re-used for different objectives, including comparing common elements across many experiments. No single error rate or data summary may be appropriate for all of the different objectives.
Assuntos
Algoritmos , Artefatos , Interpretação Estatística de Dados , Reações Falso-Positivas , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
Staphylococcus aureus is a major cause of hospital-acquired pneumonia and is emerging as an important etiological agent of community-acquired pneumonia. Little is known about the specific host-pathogen interactions that occur when S. aureus first enters the airway. A shotgun proteomics approach was utilized to identify the airway proteins associated with S. aureus during the first 6 h of infection. Host proteins eluted from bacteria recovered from the airways of mice 30 min or 6 h following intranasal inoculation under anesthesia were subjected to liquid chromatography and tandem mass spectrometry. A total of 513 host proteins were associated with S. aureus 30 min and/or 6 h postinoculation. A majority of the identified proteins were host cytosolic proteins, suggesting that S. aureus was rapidly internalized by phagocytes in the airway and that significant host cell lysis occurred during early infection. In addition, extracellular matrix and secreted proteins, including fibronectin, antimicrobial peptides, and complement components, were associated with S. aureus at both time points. The interaction of 12 host proteins shown to bind to S. aureus in vitro was demonstrated in vivo for the first time. The association of hemoglobin, which is thought to be the primary staphylococcal iron source during infection, with S. aureus in the airway was validated by immunoblotting. Thus, we used our recently developed S. aureus pneumonia model and shotgun proteomics to validate previous in vitro findings and to identify nearly 500 other proteins that interact with S. aureus in vivo. The data presented here provide novel insights into the host-pathogen interactions that occur when S. aureus enters the airway.
Assuntos
Interações Hospedeiro-Patógeno , Pneumonia/microbiologia , Proteínas/isolamento & purificação , Infecções Estafilocócicas/microbiologia , Staphylococcus aureus/química , Animais , Líquido da Lavagem Broncoalveolar/química , Líquido da Lavagem Broncoalveolar/microbiologia , Cromatografia Líquida , Feminino , Humanos , Immunoblotting , Masculino , Camundongos , Camundongos Endogâmicos C57BL , Ligação Proteica , Proteínas/química , Proteoma/análise , Proteoma/isolamento & purificação , Espectrometria de Massas em TandemRESUMO
Pneumonia caused by Staphylococcus aureus is a growing concern in the health care community. We hypothesized that characterization of the early innate immune response to bacteria in the lungs would provide insight into the mechanisms used by the host to protect itself from infection. An adult mouse model of Staphylococcus aureus pneumonia was utilized to define the early events in the innate immune response and to assess the changes in the airway proteome during the first 6 h of pneumonia. S. aureus actively replicated in the lungs of mice inoculated intranasally under anesthesia to cause significant morbidity and mortality. By 6 h postinoculation, the release of proinflammatory cytokines caused effective recruitment of neutrophils to the airway. Neutrophil influx, loss of alveolar architecture, and consolidated pneumonia were observed histologically 6 h postinoculation. Bronchoalveolar lavage fluids from mice inoculated with phosphate-buffered saline (PBS) or S. aureus were depleted of overabundant proteins and subjected to strong cation exchange fractionation followed by liquid chromatography and tandem mass spectrometry to identify the proteins present in the airway. No significant changes in response to PBS inoculation or 30 min following S. aureus inoculation were observed. However, a dramatic increase in extracellular proteins was observed 6 h postinoculation with S. aureus, with the increase dominated by inflammatory and coagulation proteins. The data presented here provide a comprehensive evaluation of the rapid and vigorous innate immune response mounted in the host airway during the earliest stages of S. aureus pneumonia.
Assuntos
Pneumonia Estafilocócica/imunologia , Proteoma/imunologia , Infecções Estafilocócicas/imunologia , Animais , Western Blotting , Líquido da Lavagem Broncoalveolar/química , Líquido da Lavagem Broncoalveolar/citologia , Cromatografia Líquida , Citocinas/análise , Citocinas/imunologia , Feminino , Pulmão/microbiologia , Pulmão/patologia , Masculino , Camundongos , Camundongos Endogâmicos C57BL , Infiltração de Neutrófilos/imunologia , Pneumonia Estafilocócica/microbiologia , Pneumonia Estafilocócica/patologia , Infecções Estafilocócicas/patologia , Staphylococcus aureusRESUMO
Increasingly, we are aware as a community of the growing need to manage the avalanche of genomic and metagenomic data, in addition to related data types like ribosomal RNA and barcode sequences, in a way that tightly integrates contextual data with traditional literature in a machine-readable way. It is for this reason that the Genomic Standards Consortium (GSC) formed in 2005. Here we suggest that we move beyond the development of standards and tackle standards compliance and improved data capture at the level of the scientific publication. We are supported in this goal by the fact that the scientific community is in the midst of a publishing revolution. This revolution is marked by a growing shift away from a traditional dichotomy between "journal articles" and "database entries" and an increasing adoption of hybrid models of collecting and disseminating scientific information. With respect to genomes and metagenomes and related data types, we feel the scientific community would be best served by the immediate launch of a central repository of short, highly structured "Genome Notes" that must be standards compliant. This could be done in the context of an existing journal, but we also suggest the more radical solution of launching a new journal. Such a journal could be designed to cater to a wide range of standards-related content types that are not currently centralized in the published literature. It could also support the demand for centralizing aspects of the "gray literature" (documents developed by institutions or communities) such as the call by the GSC for a central repository of Standard Operating Procedures describing the genomic annotation pipelines of the major sequencing centers. We argue that such an "eJournal," published under the Open Access paradigm by the GSC, could be an attractive publishing forum for a broader range of standardization initiatives within, and beyond, the GSC and thereby fill an unoccupied yet increasingly important niche within the current research landscape.
Assuntos
Genômica/normas , Fidelidade a Diretrizes , PublicaçõesRESUMO
This meeting report summarizes the proceedings of the "eGenomics: Cataloguing our Complete Genome Collection IV" workshop held June 6-8, 2007, at the National Institute for Environmental eScience (NIEeS), Cambridge, United Kingdom. This fourth workshop of the Genomic Standards Consortium (GSC) was a mix of short presentations, strategy discussions, and technical sessions. Speakers provided progress reports on the development of the "Minimum Information about a Genome Sequence" (MIGS) specification and the closely integrated "Minimum Information about a Metagenome Sequence" (MIMS) specification. The key outcome of the workshop was consensus on the next version of the MIGS/MIMS specification (v1.2). This drove further definition and restructuring of the MIGS/MIMS XML schema (syntax). With respect to semantics, a term vetting group was established to ensure that terms are properly defined and submitted to the appropriate ontology projects. Perhaps the single most important outcome of the workshop was a proposal to move beyond the concept of "minimum" to create a far richer XML schema that would define a "Genomic Contextual Data Markup Language" (GCDML) suitable for wider semantic integration across databases. GCDML will contain not only curated information (e.g., compliant with MIGS/MIMS), but also be extended to include a variety of data processing and calculations. Further information about the Genomic Standards Consortium and its range of activities can be found at http://gensc.org.
Assuntos
Bases de Dados Genéticas , Genômica , Educação , Linguagens de Programação , Padrões de ReferênciaRESUMO
MOTIVATION: Tandem mass-spectrometry of trypsin digests, followed by database searching, is one of the most popular approaches in high-throughput proteomics studies. Peptides are considered identified if they pass certain scoring thresholds. To avoid false positive protein identification, > or = 2 unique peptides identified within a single protein are generally recommended. Still, in a typical high-throughput experiment, hundreds of proteins are identified only by a single peptide. We introduce here a method for distinguishing between true and false identifications among single-hit proteins. The approach is based on randomized database searching and usage of logistic regression models with cross-validation. This approach is implemented to analyze three bacterial samples enabling recovery 68-98% of the correct single-hit proteins with an error rate of < 2%. This results in a 22-65% increase in number of identified proteins. Identifying true single-hit proteins will lead to discovering many crucial regulators, biomarkers and other low abundance proteins. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação/métodos , Espectrometria de Massas/métodos , Mapeamento de Peptídeos/métodos , Proteínas/análise , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Simulação por Computador , Sistemas de Gerenciamento de Base de Dados , Modelos Logísticos , Modelos Químicos , Modelos Moleculares , Dados de Sequência Molecular , Reconhecimento Automatizado de Padrão , Proteínas/química , Análise de RegressãoRESUMO
The identification and quantification of the proteins that a whole organism expresses under certain conditions is a main focus of high-throughput proteomics. Advanced proteomics approaches generate new biologically relevant data and potent hypotheses. A practical report of what proteome studies can and cannot accomplish in common laboratory settings is presented here. The review discusses the most popular tandem mass-spectrometry-based methods and focuses on how to produce reliable results. A step-by-step description of proteome experiments is given, including sample preparation, digestion, labeling, liquid chromatography, data processing, database searching and statistical analysis. The difficulties and bottlenecks of proteome analysis are addressed and the requirements for further improvements are discussed. Several diverse high-throughput proteomics-based studies of microorganisms are described.
Assuntos
Proteínas/análise , Proteômica/métodos , Espectrometria de Massas por Ionização por Electrospray/métodos , Sequência de Aminoácidos , Interpretação Estatística de Dados , Dados de Sequência MolecularRESUMO
Determining the error rate for peptide and protein identification accurately and reliably is necessary to enable evaluation and crosscomparisons of high throughput proteomics experiments. Currently, peptide identification is based either on preset scoring thresholds or on probabilistic models trained on datasets that are often dissimilar to experimental results. The false discovery rates (FDR) and peptide identification probabilities for these preset thresholds or models often vary greatly across different experimental treatments, organisms, or instruments used in specific experiments. To overcome these difficulties, randomized databases have been used to estimate the FDR. However, the cumulative FDR may include low probability identifications when there are a large number of peptide identifications and exclude high probability identifications when there are few. To overcome this logical inconsistency, this study expands the use of randomized databases to generate experiment-specific estimates of peptide identification probabilities. These experiment-specific probabilities are generated by logistic and Loess regression models of the peptide scores obtained from original and reshuffled database matches. These experiment-specific probabilities are shown to very well approximate "true" probabilities based on known standard protein mixtures across different experiments. Probabilities generated by the earlier Peptide_Prophet and more recent LIPS models are shown to differ significantly from this study's experiment-specific probabilities, especially for unknown samples. The experiment-specific probabilities reliably estimate the accuracy of peptide identifications and overcome potential logical inconsistencies of the cumulative FDR. This estimation method is demonstrated using a Sequest database search, LIPS model, and a reshuffled database. However, this approach is generally applicable to any search algorithm, peptide scoring, and statistical model when using a randomized database.
Assuntos
Bases de Dados de Proteínas , Peptídeos/química , Algoritmos , Modelos Biológicos , Probabilidade , Distribuição Aleatória , Análise de Regressão , SoftwareRESUMO
The availability of genome sequences from a variety of organisms presents an opportunity to apply this sequence information to solving the key problems of molecular biology. One of the principal roadblocks on this path is the lack of appropriate descriptors and metrics that could succinctly represent the new knowledge stemming from the genomic data. Several new metrics have recently been used in comparative genome analysis, yet challenges remain in finding an appropriate language for the emerging discipline of systems biology.