RESUMO
The Proteomics Standards Initiative has recently released the mzIdentML data standard for representing peptide and protein identification results, for example, created by a search engine. When a new standard format is produced, it is important that software tools are available that make it straightforward for laboratory scientists to use it routinely and for bioinformaticians to embed support in their own tools. Here we report the release of several open-source Java-based software packages based on mzIdentML: ProteoIDViewer, mzidLibrary, and mzidValidator. The ProteoIDViewer is a desktop application allowing users to visualize mzIdentML-formatted results originating from any appropriate identification software; it supports visualization of all the features of the mzIdentML format. The mzidLibrary is a software library containing routines for importing data from external search engines, post-processing identification data (such as false discovery rate calculations), combining results from multiple search engines, performing protein inference, setting identification thresholds, and exporting results from mzIdentML to plain text files. The mzidValidator is able to process files and report warnings or errors if files are not correctly formatted or contain some semantic error. We anticipate that these developments will simplify adoption of the new standard in proteomics laboratories and the integration of mzIdentML into other software tools. All three tools are freely available in the public domain.
Assuntos
Peptídeos/química , Proteínas/química , Proteômica/estatística & dados numéricos , Software , Proteômica/normas , Ferramenta de BuscaRESUMO
Determining the list of proteins present in a sample, based on the list of identified peptides, is a crucial step in the untargeted proteomics LC-MS/MS data-processing pipeline. This step, commonly referred to as protein inference, turns out to be a very challenging problem because many peptide sequences are found across multiple proteins. Current protein inference engines typically use peptide to spectrum match (PSM) quality measures and spectral count information to score protein identifications in LC-MS/MS data sets. This is, however, not enough to confidently validate or otherwise rule out many of the proteins. Here we introduce the basis for a new way of performing protein inference based on accurate quantification patterns of identified peptides using the correlation of these patterns to validate peptide to protein matches. For the first implementation of this new approach, we focused on (1) distinguishing between unambiguously and ambiguously identified proteins and (2) generating hypotheses for the discrimination of subsets of the ambiguously identified proteins. Our preprocessing pipelines support both labeled LC-MS/MS or label-free LC-MS followed by LC-MS/MS providing the peptide quantification. We apply our procedure to two published data sets and show that it is able to detect and infer proteins that would otherwise not be confidently inferred.
Assuntos
Mapeamento de Peptídeos/métodos , Software , Sequência de Aminoácidos , Humanos , Dados de Sequência Molecular , Proteômica , Espectrometria de Massas em TandemRESUMO
International cancer registries make real-world genomic and clinical data available, but their joint analysis remains a challenge. AACR Project GENIE, an international cancer registry collecting data from 19 cancer centers, makes data from >130,000 patients publicly available through the cBioPortal for Cancer Genomics (https://genie.cbioportal.org). For 25,000 patients, additional real-world longitudinal clinical data, including treatment and outcome data, are being collected by the AACR Project GENIE Biopharma Collaborative using the PRISSMM data curation model. Several thousand of these cases are now also available in cBioPortal. We have significantly enhanced the functionalities of cBioPortal to support the visualization and analysis of this rich clinico-genomic linked dataset, as well as datasets generated by other centers and consortia. Examples of these enhancements include (i) visualization of the longitudinal clinical and genomic data at the patient level, including timelines for diagnoses, treatments, and outcomes; (ii) the ability to select samples based on treatment status, facilitating a comparison of molecular and clinical attributes between samples before and after a specific treatment; and (iii) survival analysis estimates based on individual treatment regimens received. Together, these features provide cBioPortal users with a toolkit to interactively investigate complex clinico-genomic data to generate hypotheses and make discoveries about the impact of specific genomic variants on prognosis and therapeutic sensitivities in cancer. SIGNIFICANCE: Enhanced cBioPortal features allow clinicians and researchers to effectively investigate longitudinal clinico-genomic data from patients with cancer, which will improve exploration of data from the AACR Project GENIE Biopharma Collaborative and similar datasets.
Assuntos
Genômica , Neoplasias , Humanos , Neoplasias/genética , Neoplasias/terapia , Medicina de PrecisãoRESUMO
PURPOSE: Interpretation of genomic variants in tumor samples still presents a challenge in research and the clinical setting. A major issue is that information for variant interpretation is fragmented across disparate databases, and aggregation of information from these requires building extensive infrastructure. To this end, we have developed Genome Nexus, a one-stop shop for variant annotation with a user-friendly interface for cancer researchers and clinicians. METHODS: Genome Nexus (1) aggregates variant information from sources that are relevant to cancer research and clinical applications, (2) allows high-performance programmatic access to the aggregated data via a unified application programming interface, (3) provides a reference page for individual cancer variants, (4) provides user-friendly tools for annotating variants in patients, and (5) is freely available under an open source license and can be installed in a private cloud or local environment and integrated with local institutional resources. RESULTS: Genome Nexus is available at https://www.genomenexus.org. It displays annotations from more than a dozen resources including those that provide variant effect information (variant effect predictor), protein sequence annotation (Uniprot, Pfam, and dbPTM), functional consequence prediction (Polyphen-2, Mutation Assessor, and SIFT), population prevalences (gnomAD, dbSNP, and ExAC), cancer population prevalences (Cancer hotspots and SignalDB), and clinical actionability (OncoKB, CIViC, and ClinVar). We describe several use cases that demonstrate the utility of Genome Nexus to clinicians, researchers, and bioinformaticians. We cover single-variant annotation, cohort analysis, and programmatic use of the application programming interface. Genome Nexus is unique in providing a user-friendly interface specific to cancer that allows high-performance annotation of any variant including unknown ones. CONCLUSION: Interpretation of cancer genomic variants is improved tremendously by having an integrated resource for annotations. Genome Nexus is freely available under an open source license.