RESUMO
New technologies in genomics and proteomics have influenced the emergence of proteogenomics, a field at the confluence of genomics, transcriptomics, and proteomics. First generation proteogenomic toolkits employ peptide mass spectrometry to identify novel protein coding regions. We extend first generation proteogenomic tools to achieve greater accuracy and enable the analysis of large, complex genomes. We apply our pipeline to Zea mays, which has a genome comparable in size to human. Our pipeline begins with the comparison of mass spectra to a putative translation of the genome. We select novel peptides, those that match a region of the genome that was not previously known to be protein coding, for grouping into refinement events. We present a novel, probabilistic framework for evaluating the accuracy of each event. Our calculated event probability, or eventProb, considers the number of supporting peptides and spectra, and the quality of each supporting peptide-spectrum match. Our pipeline predicts 165 novel protein-coding genes and proposes updated models for 741 additional genes.
Assuntos
Genômica , Proteômica , Zea mays/genética , Genoma de Planta , Humanos , Espectrometria de Massas , Fases de Leitura AbertaRESUMO
Database search algorithms are the primary workhorses for the identification of tandem mass spectra. However, these methods are limited to the identification of spectra for which peptides are present in the database, preventing the identification of peptides from mutated or alternatively spliced sequences. A variety of methods has been developed to search a spectrum against a sequence allowing for variations. Some tools determine the sequence of the homologous protein in the related species but do not report the peptide in the target organism. Other tools consider variations, including modifications and mutations, in reconstructing the target sequence. However, these tools will not work if the template (homologous peptide) is missing in the database, and they do not attempt to reconstruct the entire protein target sequence. De novo identification of peptide sequences is another possibility, because it does not require a protein database. However, the lack of database reduces the accuracy. We present a novel proteogenomic approach, GenoMS, that draws on the strengths of database and de novo peptide identification methods. Protein sequence templates (i.e. proteins or genomic sequences that are similar to the target protein) are identified using the database search tool InsPecT. The templates are then used to recruit, align, and de novo sequence regions of the target protein that have diverged from the database or are missing. We used GenoMS to reconstruct the full sequence of an antibody by using spectra acquired from multiple digests using different proteases. Antibodies are a prime example of proteins that confound standard database identification techniques. The mature antibody genes result from large-scale genome rearrangements with flexible fusion boundaries and somatic hypermutation. Using GenoMS we automatically reconstruct the complete sequences of two immunoglobulin chains with accuracy greater than 98% using a diverged protein database. Using the genome as the template, we achieve accuracy exceeding 97%.
Assuntos
Bases de Dados de Proteínas , Proteômica/métodos , Análise de Sequência de Proteína/métodos , Moldes Genéticos , Algoritmos , Sequência de Aminoácidos , Animais , Imunoglobulinas/biossíntese , Imunoglobulinas/química , Cadeias de Markov , Camundongos , Receptores Imunológicos/química , Receptores Imunológicos/metabolismo , Alinhamento de Sequência , Espectrometria de Massas em TandemRESUMO
A mouse hybridoma antibody directed against a member of the tumour necrosis factor (TNF)-superfamily, lymphotoxin-alpha (LT-α), was isolated from stored mouse ascites and purified to homogeneity. After more than a decade of storage the genetic material was not available for cloning; however, biochemical assays with the ascites showed this antibody against LT-α (LT-3F12) to be a preclinical candidate for the treatment of several inflammatory pathologies. We have successfully rescued the LT-3F12 antibody by performing MS analysis, primary amino acid sequence determination by template proteogenomics, and synthesis of the corresponding recombinant DNA by reverse engineering. The resurrected antibody was expressed, purified and shown to demonstrate the desired specificity and binding properties in a panel of immuno-biochemical tests. The work described herein demonstrates the powerful combination of high-throughput informatic proteomic de novo sequencing with reverse engineering to reestablish monoclonal antibody-expressing cells from archived protein sample, exemplifying the development of novel therapeutics from cryptic protein sources.
Assuntos
Anticorpos Anti-Idiotípicos/metabolismo , Anticorpos Monoclonais/metabolismo , Engenharia Genética , Genômica , Linfotoxina-alfa/metabolismo , Proteômica , Proteínas Recombinantes/metabolismo , Sequência de Aminoácidos , Animais , Anticorpos Anti-Idiotípicos/genética , Anticorpos Anti-Idiotípicos/imunologia , Anticorpos Monoclonais/genética , Anticorpos Monoclonais/imunologia , Células Cultivadas , Endotélio Vascular/citologia , Endotélio Vascular/metabolismo , Hibridomas , Linfotoxina-alfa/genética , Linfotoxina-alfa/imunologia , Camundongos , Dados de Sequência Molecular , Proteínas Recombinantes/genética , Proteínas Recombinantes/imunologia , Homologia de Sequência de Aminoácidos , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz , Veias Umbilicais/citologia , Veias Umbilicais/metabolismoRESUMO
Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of Arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corresponded to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the currently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 novel peptides were found that do not correspond to annotated genes. Using the gene finding program AUGUSTUS and 5,426 novel peptides that occurred in clusters, we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 novel peptides provide high quality annotation (>99% correct) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match current gene models suggests that 13% of the Arabidopsis proteome was incomplete due to approximately equal numbers of missing and incorrect gene models.
Assuntos
Proteínas de Arabidopsis/genética , Arabidopsis/genética , Genoma de Planta/genética , Proteoma/genética , Proteômica , Software , Modelos Genéticos , Proteômica/métodosRESUMO
BACKGROUND: Proteins are known to be dynamic in nature, changing from one conformation to another while performing vital cellular tasks. It is important to understand these movements in order to better understand protein function. At the same time, experimental techniques provide us with only single snapshots of the whole ensemble of available conformations. Computational protein morphing provides a visualization of a protein structure transitioning from one conformation to another by producing a series of intermediate conformations. RESULTS: We present a novel, efficient morphing algorithm, Morph-Pro based on linear interpolation. We also show that apart from visualization, morphing can be used to provide plausible intermediate structures. We test this by using the intermediate structures of a c-Jun N-terminal kinase (JNK1) conformational change in a virtual docking experiment. The structures are shown to dock with higher score to known JNK1-binding ligands than structures solved using X-Ray crystallography. This experiment demonstrates the potential applications of the intermediate structures in modeling or virtual screening efforts. CONCLUSIONS: Visualization of protein conformational changes is important for characterization of protein function. Furthermore, the intermediate structures produced by our algorithm are good approximations to true structures. We believe there is great potential for these computationally predicted structures in protein-ligand docking experiments and virtual screening. The Morph-Pro web server can be accessed at http://morph-pro.bioinf.spbau.ru.