ABSTRACT
BACKGROUND: Germline pathogenic variants in the breast cancer type 1 susceptibility gene BRCA1 are associated with a 60% lifetime risk for breast and ovarian cancer. This overall risk estimate is for all BRCA1 variants; obviously, not all variants confer the same risk of developing a disease. In cancer patients, loss of BRCA1 function in tumor tissue has been associated with an increased sensitivity to platinum agents and to poly-(ADP-ribose) polymerase (PARP) inhibitors. For clinical management of both at-risk individuals and cancer patients, it would be important that each identified genetic variant be associated with clinical significance. Unfortunately for the vast majority of variants, the clinical impact is unknown. The availability of results from studies assessing the impact of variants on protein function may provide insight of crucial importance. RESULTS AND CONCLUSION: We have collected, curated, and structured the molecular and cellular phenotypic impact of 3654 distinct BRCA1 variants. The data was modeled in triple format, using the variant as a subject, the studied function as the object, and a predicate describing the relation between the two. Each annotation is supported by a fully traceable evidence. The data was captured using standard ontologies to ensure consistency, and enhance searchability and interoperability. We have assessed the extent to which functional defects at the molecular and cellular levels correlate with the clinical interpretation of variants by ClinVar submitters. Approximately 30% of the ClinVar BRCA1 missense variants have some molecular or cellular assay available in the literature. Pathogenic variants (as assigned by ClinVar) have at least some significant functional defect in 94% of testable cases. For benign variants, 77% of ClinVar benign variants, for which neXtProt Cancer variant portal has data, shows either no or mild experimental functional defects. While this does not provide evidence for clinical interpretation of variants, it may provide some guidance for variants of unknown significance, in the absence of more reliable data. The neXtProt Cancer variant portal ( https://www.nextprot.org/portals/breast-cancer ) contains over 6300 observations at the molecular and/or cellular level for BRCA1 variants.
Subject(s)
BRCA1 Protein/genetics , Breast Neoplasms/genetics , Genetic Predisposition to Disease , Ovarian Neoplasms/genetics , Adult , Aged , BRCA1 Protein/chemistry , Breast Neoplasms/pathology , Computational Biology , Female , Genetic Variation , Germ-Line Mutation/genetics , Humans , Middle Aged , Ovarian Neoplasms/pathology , Protein ConformationABSTRACT
The neXtProt human protein knowledgebase (https://www.nextprot.org) continues to add new content and tools, with a focus on proteomics and genetic variation data. neXtProt now has proteomics data for over 85% of the human proteins, as well as new tools tailored to the proteomics community.Moreover, the neXtProt release 2016-08-25 includes over 8000 phenotypic observations for over 4000 variations in a number of genes involved in hereditary cancers and channelopathies. These changes are presented in the current neXtProt update. All of the neXtProt data are available via our user interface and FTP site. We also provide an API access and a SPARQL endpoint for more technical applications.
Subject(s)
Databases, Protein , Proteomics , Genetic Association Studies , Genetic Variation , Humans , Internet , Phenotype , Proteomics/methods , Software , Web BrowserABSTRACT
neXtProt (http://www.nextprot.org) is a human protein-centric knowledgebase developed at the SIB Swiss Institute of Bioinformatics. Focused solely on human proteins, neXtProt aims to provide a state of the art resource for the representation of human biology by capturing a wide range of data, precise annotations, fully traceable data provenance and a web interface which enables researchers to find and view information in a comprehensive manner. Since the introductory neXtProt publication, significant advances have been made on three main aspects: the representation of proteomics data, an extended representation of human variants and the development of an advanced search capability built around semantic technologies. These changes are presented in the current neXtProt update.
Subject(s)
Databases, Protein , Genetic Variation , Proteins/genetics , Proteomics , Cell Line , Disease/genetics , Humans , Internet , ProteomeABSTRACT
MOTIVATION: Lipids are a large and diverse group of biological molecules with roles in membrane formation, energy storage and signaling. Cellular lipidomes may contain tens of thousands of structures, a staggering degree of complexity whose significance is not yet fully understood. High-throughput mass spectrometry-based platforms provide a means to study this complexity, but the interpretation of lipidomic data and its integration with prior knowledge of lipid biology suffers from a lack of appropriate tools to manage the data and extract knowledge from it. RESULTS: To facilitate the description and exploration of lipidomic data and its integration with prior biological knowledge, we have developed a knowledge resource for lipids and their biology-SwissLipids. SwissLipids provides curated knowledge of lipid structures and metabolism which is used to generate an in silico library of feasible lipid structures. These are arranged in a hierarchical classification that links mass spectrometry analytical outputs to all possible lipid structures, metabolic reactions and enzymes. SwissLipids provides a reference namespace for lipidomic data publication, data exploration and hypothesis generation. The current version of SwissLipids includes over 244 000 known and theoretically possible lipid structures, over 800 proteins, and curated links to published knowledge from over 620 peer-reviewed publications. We are continually updating the SwissLipids hierarchy with new lipid categories and new expert curated knowledge. AVAILABILITY: SwissLipids is freely available at http://www.swisslipids.org/. CONTACT: alan.bridge@isb-sib.ch SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Computational Biology/methods , Databases, Factual , Knowledge Bases , Lipid Metabolism , Lipids/chemistry , Lipids/physiology , Mass Spectrometry/methods , Humans , Lipids/analysisABSTRACT
neXtProt (http://www.nextprot.org/) is a new human protein-centric knowledge platform. Developed at the Swiss Institute of Bioinformatics (SIB), it aims to help researchers answer questions relevant to human proteins. To achieve this goal, neXtProt is built on a corpus containing both curated knowledge originating from the UniProtKB/Swiss-Prot knowledgebase and carefully selected and filtered high-throughput data pertinent to human proteins. This article presents an overview of the database and the data integration process. We also lay out the key future directions of neXtProt that we consider the necessary steps to make neXtProt the one-stop-shop for all research projects focusing on human proteins.
Subject(s)
Databases, Protein , Humans , Knowledge Bases , Proteins/genetics , Proteins/metabolism , User-Computer InterfaceABSTRACT
About 5000 (25%) of the ~20400 human protein-coding genes currently lack any experimental evidence at the protein level. For many others, there is only little information relative to their abundance, distribution, subcellular localization, interactions, or cellular functions. The aim of the HUPO Human Proteome Project (HPP, www.thehpp.org ) is to collect this information for every human protein. HPP is based on three major pillars: mass spectrometry (MS), antibody/affinity capture reagents (Ab), and bioinformatics-driven knowledge base (KB). To meet this objective, the Chromosome-Centric Human Proteome Project (C-HPP) proposes to build this catalog chromosome-by-chromosome ( www.c-hpp.org ) by focusing primarily on proteins that currently lack MS evidence or Ab detection. These are termed "missing proteins" by the HPP consortium. The lack of observation of a protein can be due to various factors including incorrect and incomplete gene annotation, low or restricted expression, or instability. neXtProt ( www.nextprot.org ) is a new web-based knowledge platform specific for human proteins that aims to complement UniProtKB/Swiss-Prot ( www.uniprot.org ) with detailed information obtained from carefully selected high-throughput experiments on genomic variation, post-translational modifications, as well as protein expression in tissues and cells. This article describes how neXtProt contributes to prioritize C-HPP efforts and integrates C-HPP results with other research efforts to create a complete human proteome catalog.
Subject(s)
Databases, Protein , Proteins , Proteome , Chromosomes, Human , Computational Biology , Genome, Human , Humans , Internet , Knowledge Bases , Mass Spectrometry , Protein Processing, Post-Translational , Proteins/genetics , Proteins/metabolismABSTRACT
Viral infections are the leading cause of childhood acute febrile illnesses motivating consultation in sub-Saharan Africa. The majority of causal viruses are never identified in low-resource clinical settings as such testing is either not part of routine screening or available diagnostic tools have limited ability to detect new/unexpected viral variants. An in-depth exploration of the blood virome is therefore necessary to clarify the potential viral origin of fever in children. Metagenomic next-generation sequencing is a powerful tool for such broad investigations, allowing the detection of RNA and DNA viral genomes. Here, we describe the blood virome of 816 febrile children (<5 years) presenting at outpatient departments in Dar es Salaam over one-year. We show that half of the patients (394/816) had at least one detected virus recognized as causes of human infection/disease (13.8% enteroviruses (enterovirus A, B, C, and rhinovirus A and C), 12% rotaviruses, 11% human herpesvirus type 6). Additionally, we report the detection of a large number of viruses (related to arthropod, vertebrate or mammalian viral species) not yet known to cause human infection/disease, highlighting those who should be on the radar, deserve specific attention in the febrile paediatric population and, more broadly, for surveillance of emerging pathogens.Trial registration: ClinicalTrials.gov identifier: NCT02225769.
Subject(s)
Fever/virology , Metagenomics/methods , Virus Diseases/blood , Viruses/classification , Child, Preschool , High-Throughput Nucleotide Sequencing , Humans , Infant , Infant, Newborn , Retrospective Studies , Sequence Analysis, DNA , Sequence Analysis, RNA , Tanzania , Virus Diseases/virology , Viruses/genetics , Viruses/isolation & purificationABSTRACT
The huge genetic diversity of circulating viruses is a challenge for diagnostic assays for emerging or rare viral diseases. High-throughput technology offers a new opportunity to explore the global virome of patients without preconception about the culpable pathogens. It requires a solid reference dataset to be accurate. Virosaurus has been designed to offer a non-biased, automatized and annotated database for clinical metagenomics studies and diagnosis. Raw viral sequences have been extracted from GenBank, and cleaned up to remove potentially erroneous sequences. Complete sequences have been identified for all genera infecting vertebrates, plants and other eukaryotes (insect, fungus, etc.). To facilitate the analysis of clinically relevant viruses, we have annotated all sequences with official and common virus names, acronym, genotypes, and genomic features (linear, circular, DNA, RNA, etc.). Sequences have been clustered to remove redundancy at 90% or 98% identity. The analysis of clustering results reveals the state of the virus genetic landscape knowledge. Because herpes and poxviruses were under-represented in complete genomes considering their potential diversity in nature, we used genes instead of complete genomes for those in Virosaurus.
Subject(s)
Databases, Nucleic Acid , Genetic Variation , Sequence Analysis, DNA , Viruses/genetics , Computational Biology , Genome, Viral , Metagenomics , Phylogeny , Viruses/classificationABSTRACT
The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.
Subject(s)
Data Curation/methods , Data Mining/methods , Databases, Factual , Software , HumansABSTRACT
Database URL: http://candy.hesge.ch/nextA5.
Subject(s)
Computational Biology/methods , Data Curation/methods , Data Mining/methods , Databases, Protein , Protein Interaction Mapping/methods , User-Computer InterfaceABSTRACT
Combining epidemiological information, genetic characterization and geomapping in the analysis of influenza can contribute to a better understanding and description of influenza epidemiology and ecology, including possible virus reassortment events. Furthermore, integration of information such as agroecological farming system characteristics can provide new knowledge on risk factors of influenza emergence and spread. Integrating viral characteristics into an animal disease information system is therefore expected to provide a unique tool to trace-and-track particular virus strains; generate clade distributions and spatiotemporal clusters; screen for distribution of viruses with specific molecular markers; identify potential risk factors; and analyze or map viral characteristics related to vaccines used for control and/or prevention. For this purpose, a genetic module was developed within EMPRES-i (FAO's global animal disease information system) linking epidemiological information from influenza events with virus characteristics and enabling combined analysis. An algorithm was developed to act as the interface between EMPRES-i disease event data and publicly available influenza virus sequences in OpenfluDB. This algorithm automatically computes potential links between outbreak event and sequences, which are subsequently manually validated by experts. Subsequently, other virus characteristics such as antiviral resistance can then be associated to outbreak data. To visualize such characteristics on a geographic map, shape files with virus characteristics to overlay on other EMPRES-i map layers (e.g. animal densities) can be generated. The genetic module allows export of associated epidemiological and sequence data for further analysis. FAO has made this tool available for scientists and policy makers. Contributions are expected from users to improve and validate the number of linked influenza events and isolate information as well as the quality of information. Possibilities to interconnect with other influenza sequence databases or to expand the genetic module to other viral diseases (e.g. foot and mouth disease) are being explored. Database OpenfluDB URL: http://openflu.vital-it.ch Database EMPRES-i URL: http://EMPRES-i.fao.org/.
Subject(s)
Algorithms , Computational Biology/methods , Databases, Genetic , Disease Outbreaks , Influenza, Human/epidemiology , Influenza, Human/virology , Orthomyxoviridae/genetics , Humans , Influenza, Human/genetics , Orthomyxoviridae/isolation & purification , Reproducibility of ResultsABSTRACT
Although research on influenza lasted for more than 100 years, it is still one of the most prominent diseases causing half a million human deaths every year. With the recent observation of new highly pathogenic H5N1 and H7N7 strains, and the appearance of the influenza pandemic caused by the H1N1 swine-like lineage, a collaborative effort to share observations on the evolution of this virus in both animals and humans has been established. The OpenFlu database (OpenFluDB) is a part of this collaborative effort. It contains genomic and protein sequences, as well as epidemiological data from more than 27,000 isolates. The isolate annotations include virus type, host, geographical location and experimentally tested antiviral resistance. Putative enhanced pathogenicity as well as human adaptation propensity are computed from protein sequences. Each virus isolate can be associated with the laboratories that collected, sequenced and submitted it. Several analysis tools including multiple sequence alignment, phylogenetic analysis and sequence similarity maps enable rapid and efficient mining. The contents of OpenFluDB are supplied by direct user submission, as well as by a daily automatic procedure importing data from public repositories. Additionally, a simple mechanism facilitates the export of OpenFluDB records to GenBank. This resource has been successfully used to rapidly and widely distribute the sequences collected during the recent human swine flu outbreak and also as an exchange platform during the vaccine selection procedure. Database URL: http://openflu.vital-it.ch.
Subject(s)
Databases, Genetic , Orthomyxoviridae/genetics , Animals , Birds , Evolution, Molecular , Genome, Viral , Humans , Influenza A Virus, H1N1 Subtype/genetics , Influenza A Virus, H1N1 Subtype/isolation & purification , Influenza in Birds/virology , Influenza, Human/virology , Orthomyxoviridae/isolation & purification , Orthomyxoviridae Infections/virology , Phylogeny , Sequence Alignment/statistics & numerical data , Swine , User-Computer InterfaceABSTRACT
In a previous paper we introduced a novel model-based approach (OLAV) to the problem of identifying peptides via tandem mass spectrometry, for which early implementations showed promising performance. We recently further improved this performance to a remarkable level (1-2% false positive rate at 95% true positive rate) and characterized key properties of OLAV like robustness and training set size. We present these results in a synthetic and coherent way along with detailed performance comparisons, a new scoring component making use of peptide amino acidic composition, and new developments like automatic parameter learning. Finally, we discuss the impact of OLAV on the automation of proteomics projects.
Subject(s)
Peptides/chemistry , Proteomics/methods , Algorithms , Automation , False Positive Reactions , Humans , Mass Spectrometry/methods , ROC Curve , Spectrometry, Mass, Electrospray Ionization/methods , Time FactorsABSTRACT
Human blood plasma is a useful source of proteins associated with both health and disease. Analysis of human blood plasma is a challenge due to the large number of peptides and proteins present and the very wide range of concentrations. In order to identify as many proteins as possible for subsequent comparative studies, we developed an industrial-scale (2.5 liter) approach involving sample pooling for the analysis of smaller proteins (M(r) generally < ca. 40 000 and some fragments of very large proteins). Plasma from healthy males was depleted of abundant proteins (albumin and IgG), then smaller proteins and polypeptides were separated into 12 960 fractions by chromatographic techniques. Analysis of proteins and polypeptides was performed by mass spectrometry prior to and after enzymatic digestion. Thousands of peptide identifications were made, permitting the identification of 502 different proteins and polypeptides from a single pool, 405 of which are listed here. The numbers refer to chromatographically separable polypeptide entities present prior to digestion. Combining results from studies with other plasma pools we have identified over 700 different proteins and polypeptides in plasma. Relatively low abundance proteins such as leptin and ghrelin and peptides such as bradykinin, all invisible to two-dimensional gel technology, were clearly identified. Proteins of interest were synthesized by chemical methods for bioassays. We believe that this is the first time that the small proteins in human blood plasma have been separated and analyzed so extensively.
Subject(s)
Blood Chemical Analysis/methods , Blood Proteins/metabolism , Plasma/metabolism , Proteomics/methods , Amino Acid Sequence , Chromatography , Chromatography, Gel , Chromatography, High Pressure Liquid , Chromatography, Ion Exchange , Computational Biology , Databases as Topic , Electrophoresis, Gel, Two-Dimensional/methods , Humans , Immunoglobulin G/chemistry , Mass Spectrometry , Molecular Sequence Data , Peptides/chemistry , Proteome , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization , Subcellular Fractions , Time Factors , Trypsin/pharmacologyABSTRACT
We present an integrated proteomics platform designed for performing differential analyses. Since reproducible results are essential for comparative studies, we explain how we improved reproducibility at every step of our laboratory processes, e.g. by taking advantage of the powerful laboratory information management system we developed. The differential capacity of our platform is validated by detecting known markers in a real sample and by a spiking experiment. We introduce an innovative two-dimensional (2-D) plot for displaying identification results combined with chromatographic data. This 2-D plot is very convenient for detecting differential proteins. We also adapt standard multivariate statistical techniques to show that peptide identification scores can be used for reliable and sensitive differential studies. The interest of the protein separation approach we generally apply is justified by numerous statistics, complemented by a comparison with a simple shotgun analysis performed on a small volume sample. By introducing an automatic integration step after mass spectrometry data identification, we are able to search numerous databases systematically, including the human genome and expressed sequence tags. Finally, we explain how rigorous data processing can be combined with the work of human experts to set high quality standards, and hence obtain reliable (false positive < 0.35%) and nonredundant protein identifications.