Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 156
Filtrar
1.
Bioinformatics ; 40(Supplement_1): i257-i265, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940141

RESUMO

MOTIVATION: Tandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification. RESULTS: We evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%-2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%-15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%-12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder's potential to enhance peptide identification for proteomic data analyses. AVAILABILITY AND IMPLEMENTATION: The source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu.


Assuntos
Bases de Dados de Proteínas , Peptídeos , Proteômica , Espectrometria de Massas em Tandem , Proteômica/métodos , Peptídeos/química , Humanos , Espectrometria de Massas em Tandem/métodos , Aprendizado Profundo , Software
2.
Anal Chem ; 96(6): 2351-2359, 2024 Feb 13.
Artigo em Inglês | MEDLINE | ID: mdl-38308813

RESUMO

The accurate prediction of suitable chiral stationary phases (CSPs) for resolving the enantiomers of a given compound poses a significant challenge in chiral chromatography. Previous attempts at developing machine learning models for structure-based CSP prediction have primarily relied on 1D SMILES strings [the simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings] or 2D graphical representations of molecular structures and have met with only limited success. In this study, we apply the recently developed 3D molecular conformation representation learning algorithm, which uses rapid conformational analysis and point clouds of atom positions in the 3D space, enabling efficient chemical structure-based machine learning. By harnessing the power of the rapid 3D molecular representation learning and a data set comprising over 300,000 chromatographic enantioseparation records sourced from the literature, our models afford notable improvements for the chemical structure-based choice of appropriate CSP for enantioseparation, paving the way for more efficient and informed decision-making in the field of chiral chromatography.

3.
Bioinformatics ; 39(6)2023 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-37252828

RESUMO

MOTIVATION: Tandem mass spectrometry is an essential technology for characterizing chemical compounds at high sensitivity and throughput, and is commonly adopted in many fields. However, computational methods for automated compound identification from their MS/MS spectra are still limited, especially for novel compounds that have not been previously characterized. In recent years, in silico methods were proposed to predict the MS/MS spectra of compounds, which can then be used to expand the reference spectral libraries for compound identification. However, these methods did not consider the compounds' 3D conformations, and thus neglected critical structural information. RESULTS: We present the 3D Molecular Network for Mass Spectra Prediction (3DMolMS), a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. We evaluated the model on the experimental spectra collected in several spectral libraries. The results showed that 3DMolMS predicted the spectra with the average cosine similarity of 0.691 and 0.478 with the experimental MS/MS spectra acquired in positive and negative ion modes, respectively. Furthermore, 3DMolMS model can be generalized to the prediction of MS/MS spectra acquired by different labs on different instruments through minor fine-tuning on a small set of spectra. Finally, we demonstrate that the molecular representation learned by 3DMolMS from MS/MS spectra prediction can be adapted to enhance the prediction of chemical properties such as the elution time in the liquid chromatography and the collisional cross section measured by ion mobility spectrometry, both of which are often used to improve compound identification. AVAILABILITY AND IMPLEMENTATION: The codes of 3DMolMS are available at https://github.com/JosieHong/3DMolMS and the web service is at https://spectrumprediction.gnps2.org.


Assuntos
Espectrometria de Massas em Tandem , Espectrometria de Massas em Tandem/métodos , Cromatografia Líquida/métodos , Conformação Molecular
4.
J Proteome Res ; 22(5): 1501-1509, 2023 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-36802412

RESUMO

Liquid chromatography coupled with tandem mass spectrometry is commonly adopted in large-scale glycoproteomic studies involving hundreds of disease and control samples. The software for glycopeptide identification in such data (e.g., the commercial software Byonic) analyzes the individual data set and does not exploit the redundant spectra of glycopeptides presented in the related data sets. Herein, we present a novel concurrent approach for glycopeptide identification in multiple related glycoproteomic data sets by using spectral clustering and spectral library searching. The evaluation on two large-scale glycoproteomic data sets showed that the concurrent approach can identify 105%-224% more spectra as glycopeptides compared to the glycopeptide identification on individual data sets using Byonic alone. The improvement of glycopeptide identification also enabled the discovery of several potential biomarkers of protein glycosylations in hepatocellular carcinoma patients.


Assuntos
Neoplasias Hepáticas , Espectrometria de Massas em Tandem , Humanos , Espectrometria de Massas em Tandem/métodos , Glicopeptídeos/análise , Cromatografia Líquida , Software
5.
Biochem Biophys Res Commun ; 671: 10-17, 2023 09 03.
Artigo em Inglês | MEDLINE | ID: mdl-37290279

RESUMO

α-amylase plays a crucial role in regulating metabolism and health by hydrolyzing of starch and glycogen. Despite comprehensive studies of this classic enzyme spanning over a century, the function of its carboxyl terminal domain (CTD) with a conserved eight ß-strands is still not fully understood. Amy63, identified from a marine bacterium, was reported as a novel multifunctional enzyme with amylase, agarase and carrageenase activities. In this study, the crystal structure of Amy63 was determined at 1.8 Å resolution, revealing high conservation with some other amylases. Interestingly, the independent amylase activity of the carboxyl terminal domain of Amy63 (Amy63_CTD) was newly discovered by the plate-based assay and mass spectrometry. To date, the Amy63_CTD alone could be regarded as the smallest amylase subunit. Moreover, the significant amylase activity of Amy63_CTD was measured over a wide range of temperature and pH, with optimal activity at 60 °C and pH 7.5. The Small-angle X-ray scattering (SAXS) data showed that the high-order oligomeric assembly gradually formed with increasing concentration of Amy63_CTD, implying the novel catalytic mechanism as revealed by the assembly structure. Therefore, the discovery of the novel independent amylase activity of Amy63_CTD suggests a possible missing step or a new perspective in the complex catalytic process of Amy63 and other related α-amylases. This work may shed light on the design of nanozymes to process marine polysaccharides efficiently.


Assuntos
Amilases , alfa-Amilases , Espalhamento a Baixo Ângulo , Difração de Raios X , alfa-Amilases/química , alfa-Amilases/metabolismo , Amido/metabolismo , Concentração de Íons de Hidrogênio
6.
Anal Chem ; 94(28): 10003-10010, 2022 07 19.
Artigo em Inglês | MEDLINE | ID: mdl-35776110

RESUMO

Glycosylation is a post-translational modification involved in many important biological functions. The aberrant alteration of glycan structure is implicit with malfunction of cells and possess potential significance in medical diagnosis of complex diseases such as cancer. Liquid chromatography tandem mass spectrometry (LC-MS/MS) has been commonly applied to the analysis of complex glycomic samples. However, the characterization of isomeric glycans from their MS/MS spectra in complex biological samples remains challenging. In this paper, we present a novel reciprocal best-hit glycan-spectrum matching (RB-GSM) approach toward characterizing N-glycans. In this method, the MS/MS spectra in the input data set are evaluated against all glycans with the matched precursor mass using customized scoring functions, where a glycan-spectrum matching (GSM) is considered to be true if it is a reciprocal best-hit, that is, it receives the highest score among not only the GSMs between the respective spectrum and all matched glycans, but also the GSMs between the respective glycan and all matched MS/MS spectra in the input data set. We evaluated this RB-GSM approach on N-glycan identification using MS/MS spectra acquired from glycan standards as well as those released from the model glycoprotein fetuin, immunoglobulin G, and human serum samples, which showed the RB-GSM is capable of distinguishing isomeric glycans.


Assuntos
Polissacarídeos , Espectrometria de Massas em Tandem , Cromatografia Líquida/métodos , Glicosilação , Humanos , Isomerismo , Polissacarídeos/química , Espectrometria de Massas em Tandem/métodos
7.
Bioinformatics ; 37(Suppl_1): i161-i168, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252973

RESUMO

MOTIVATION: The availability of human genomic data, together with the enhanced capacity to process them, is leading to transformative technological advances in biomedical science and engineering. However, the public dissemination of such data has been difficult due to privacy concerns. Specifically, it has been shown that the presence of a human subject in a case group can be inferred from the shared summary statistics of the group, e.g. the allele frequencies, or even the presence/absence of genetic variants (e.g. shared by the Beacon project) in the group. These methods rely on the availability of the target's genome, i.e. the DNA profile of a target human subject, and thus are often referred to as the membership inference method. RESULTS: In this article, we demonstrate the haplotypes, i.e. the sequence of single nucleotide variations (SNVs) showing strong genetic linkages in human genome databases, may be inferred from the summary of genomic data without using a target's genome. Furthermore, novel haplotypes that did not appear in the database may be reconstructed solely from the allele frequencies from genomic datasets. These reconstructed haplotypes can be used for a haplotype-based membership inference algorithm to identify target subjects in a case group with greater power than existing methods based on SNVs. AVAILABILITY AND IMPLEMENTATION: The implementation of the membership inference algorithms is available at https://github.com/diybu/Haplotype-based-membership-inferences.


Assuntos
Genoma Humano , Genômica , Algoritmos , Frequência do Gene , Haplótipos , Humanos
8.
Mol Cell ; 54(1): 30-42, 2014 Apr 10.
Artigo em Inglês | MEDLINE | ID: mdl-24657166

RESUMO

In Arabidopsis, multisubunit RNA polymerases IV and V orchestrate RNA-directed DNA methylation (RdDM) and transcriptional silencing, but what identifies the loci to be silenced is unclear. We show that heritable silent locus identity at a specific subset of RdDM targets requires HISTONE DEACETYLASE 6 (HDA6) acting upstream of Pol IV recruitment and siRNA biogenesis. At these loci, epigenetic memory conferring silent locus identity is erased in hda6 mutants such that restoration of HDA6 activity cannot restore siRNA biogenesis or silencing. Silent locus identity is similarly lost in mutants for the cytosine maintenance methyltransferase, MET1. By contrast, pol IV or pol V mutants disrupt silencing without erasing silent locus identity, allowing restoration of Pol IV or Pol V function to restore silencing. Collectively, these observations indicate that silent locus specification and silencing are separable steps that together account for epigenetic inheritance of the silenced state.


Assuntos
Proteínas de Arabidopsis/genética , Arabidopsis/genética , RNA Polimerases Dirigidas por DNA/genética , Epigênese Genética , Regulação Enzimológica da Expressão Gênica , Regulação da Expressão Gênica de Plantas , Histona Desacetilases/genética , Interferência de RNA , Arabidopsis/enzimologia , Proteínas de Arabidopsis/metabolismo , Citosina/metabolismo , DNA (Citosina-5-)-Metiltransferases/genética , DNA (Citosina-5-)-Metiltransferases/metabolismo , Metilação de DNA , Elementos de DNA Transponíveis , RNA Polimerases Dirigidas por DNA/metabolismo , Loci Gênicos , Genótipo , Hereditariedade , Histona Desacetilases/metabolismo , Mutação , Fenótipo , RNA Interferente Pequeno/biossíntese
9.
J Proteome Res ; 20(6): 3345-3352, 2021 06 04.
Artigo em Inglês | MEDLINE | ID: mdl-34010560

RESUMO

Glycosylation is one of the most common post-translational modifications (PTM) occurring in a large variety of proteins with important biological functions in human and other higher organisms. Liquid chromatography tandem mass spectrometry (LC-MS/MS) has been routinely used to characterize site-specific protein glycosylation at high throughput in complex glycoproteomic samples. Recently, electron transfer/high-energy collision dissociation (EThcD) was introduced for glycopeptide identification, which offers rich structural information on glycopepides with the fragment ions from the cleavages of both the glycan and the peptide backbone. Herein, we present the software GlycoHybridSeq for automated interpretation of EThcD-MS/MS spectra from glycoproteomic data using a customized scoring function, which enables the functionalities of identifying glycopeptides, characterizing glycosylation sites, and distinguishing some isomeric glycans. We evaluate GlycoHybridSeq on glycoproteomic data collected for cancer biomarker discovery. The results showed that it achieved comparable or better performance than that of Byonic and MSFragger. GlycoHybridSeq is released as an open source software and is ready to be used in large-scale glycoproteomic data analyses.


Assuntos
Glicopeptídeos , Espectrometria de Massas em Tandem , Cromatografia Líquida , Elétrons , Glicosilação , Humanos
10.
Bioinformatics ; 36(19): 4838-4845, 2020 12 08.
Artigo em Inglês | MEDLINE | ID: mdl-32311007

RESUMO

MOTIVATION: Third generation sequencing techniques, such as the Single Molecule Real Time technique from PacBio and the MinION technique from Oxford Nanopore, can generate long, error-prone sequencing reads which pose new challenges for fragment assembly algorithms. In this paper, we study the overlap detection problem for error-prone reads, which is the first and most critical step in the de novo fragment assembly. We observe that all the state-of-the-art methods cannot achieve an ideal accuracy for overlap detection (in terms of relatively low precision and recall) due to the high sequencing error rates, especially when the overlap lengths between reads are relatively short (e.g. <2000 bases). This limitation appears inherent to these algorithms due to their usage of q-gram-based seeds under the seed-extension framework. RESULTS: We propose smooth q-gram, a variant of q-gram that captures q-gram pairs within small edit distances and design a novel algorithm for detecting overlapping reads using smooth q-gram-based seeds. We implemented the algorithm and tested it on both PacBio and Nanopore sequencing datasets. Our benchmarking results demonstrated that our algorithm outperforms the existing q-gram-based overlap detection algorithms, especially for reads with relatively short overlapping lengths. AVAILABILITY AND IMPLEMENTATION: The source code of our implementation in C++ is available at https://github.com/FIGOGO/smoothq. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Nanoporos , Algoritmos , Análise de Sequência de DNA , Software
11.
Bioinformatics ; 36(Suppl_1): i128-i135, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657380

RESUMO

MOTIVATION: The generalized linear mixed model (GLMM) is an extension of the generalized linear model (GLM) in which the linear predictor takes random effects into account. Given its power of precisely modeling the mixed effects from multiple sources of random variations, the method has been widely used in biomedical computation, for instance in the genome-wide association studies (GWASs) that aim to detect genetic variance significantly associated with phenotypes such as human diseases. Collaborative GWAS on large cohorts of patients across multiple institutions is often impeded by the privacy concerns of sharing personal genomic and other health data. To address such concerns, we present in this paper a privacy-preserving Expectation-Maximization (EM) algorithm to build GLMM collaboratively when input data are distributed to multiple participating parties and cannot be transferred to a central server. We assume that the data are horizontally partitioned among participating parties: i.e. each party holds a subset of records (including observational values of fixed effect variables and their corresponding outcome), and for all records, the outcome is regulated by the same set of known fixed effects and random effects. RESULTS: Our collaborative EM algorithm is mathematically equivalent to the original EM algorithm commonly used in GLMM construction. The algorithm also runs efficiently when tested on simulated and real human genomic data, and thus can be practically used for privacy-preserving GLMM construction. We implemented the algorithm for collaborative GLMM (cGLMM) construction in R. The data communication was implemented using the rsocket package. AVAILABILITY AND IMPLEMENTATION: The software is released in open source at https://github.com/huthvincent/cGLMM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Estudo de Associação Genômica Ampla , Privacidade , Genômica , Humanos , Modelos Lineares , Software
12.
Mol Cell Proteomics ; 18(8 suppl 1): S183-S192, 2019 08 09.
Artigo em Inglês | MEDLINE | ID: mdl-31142575

RESUMO

Matching metagenomic and/or metatranscriptomic data, currently often under-used, can be useful reference for metaproteomic tandem mass spectra (MS/MS) data analysis. Here we developed a software pipeline for identification of peptides and proteins from metaproteomic MS/MS data using proteins derived from matching metagenomic (and metatranscriptomic) data as the search database, based on two novel approaches Graph2Pro (published) and Var2Pep (new). Graph2Pro retains and uses uncertainties of metagenome assembly for reference-based MS/MS data analysis. Var2Pep considers the variations found in metagenomic/metatranscriptomic sequencing reads that are not retained in the assemblies (contigs). The new software pipeline provides one stop application of both tools, and it supports the use of metagenome assembly from commonly used assemblers including MegaHit and metaSPAdes. When tested on two collections of multi-omic microbiome data sets, our pipeline significantly improved the identification rate of the metaproteomic MS/MS spectra by about two folds, comparing to conventional contig- or read-based approaches (the Var2Pep alone identified 5.6% to 24.1% more unique peptides, depending on the data set). We also showed that identified variant peptides are important for functional profiling of microbiomes. All results suggested that it is important to take into consideration of the assembly uncertainties and genomic variants to facilitate metaproteomic MS/MS data interpretation.


Assuntos
Algoritmos , Microbiota/genética , Proteogenômica/métodos , Água do Mar/microbiologia , Águas Residuárias/microbiologia , Bases de Dados de Proteínas , Variação Genética , Peptídeos/genética , Espectrometria de Massas em Tandem
13.
Proteomics ; 20(21-22): e2000002, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-32415809

RESUMO

With the accumulation of MS/MS spectra collected in spectral libraries, the spectral library searching approach emerges as an important approach for peptide identification in proteomics, complementary to the commonly used protein database searching approach, in particular for the proteomic analyses of well-studied model organisms, such as human. Existing spectral library searching algorithms compare a query MS/MS spectrum with each spectrum in the library with matched precursor mass and charge state, which may become computationally intensive with the rapidly growing library size. Here, the software msSLASH, which implements a fast spectral library searching algorithm based on the Locality-Sensitive Hashing (LSH) technique, is presented. The algorithm first converts the library and query spectra into bit-strings using LSH functions, and then computes the similarity between the spectra with highly similar bit-string. Using the spectral library searching of large real-world MS/MS spectra datasets, it is demonstrated that the algorithm significantly reduced the number of spectral comparisons, and as a result, achieved 2-9X speedup in comparison with existing spectral library searching algorithm SpectraST. The spectral searching algorithm is implemented in C/C++, and is ready to be used in proteomic data analyses.


Assuntos
Proteômica , Espectrometria de Massas em Tandem , Algoritmos , Bases de Dados de Proteínas , Humanos , Biblioteca de Peptídeos , Software
14.
Anal Chem ; 92(6): 4275-4283, 2020 03 17.
Artigo em Inglês | MEDLINE | ID: mdl-32053352

RESUMO

The ability to predict tandem mass (MS/MS) spectra from peptide sequences can significantly enhance our understanding of the peptide fragmentation process and could improve peptide identification in proteomics. However, current approaches for predicting high-energy collisional dissociation (HCD) spectra are limited to predict the intensities of expected ion types, that is, the a/b/c/x/y/z ions and their neutral loss derivatives (referred to as backbone ions). In practice, backbone ions only account for <70% of total ion intensities in HCD spectra, indicating many intense ions are ignored by current predictors. In this paper, we present a deep learning approach that can predict the complete spectra (both backbone and nonbackbone ions) directly from peptide sequences. We made no assumptions or expectations on which kind of ions to predict but instead predicting the intensities for all possible m/z. Training this model needs no annotations of fragment ion nor any prior knowledge of the fragmentation rules. Our analyses show that the predicted 2+ and 3+ HCD spectra are highly similar to the experimental spectra, with average full-spectrum cosine similarities of 0.820 (±0.088) and 0.786 (±0.085), respectively, very close to the similarities between the experimental replicated spectra. In contrast, the best-performed backbone only models can only achieve an average similarity below 0.75 and 0.70 for 2+ and 3+ spectra, respectively. Furthermore, we developed a multitask learning (MTL) approach for predicting spectra of insufficient training samples, which allows our model to make accurate predictions for electron transfer dissociation (ETD) spectra and HCD spectra of less abundant charges (1+ and 4+).


Assuntos
Redes Neurais de Computação , Peptídeos/análise , Espectrometria de Massas em Tandem
15.
Genes Dev ; 26(16): 1825-36, 2012 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-22855789

RESUMO

Multisubunit RNA polymerases IV and V (Pols IV and V) mediate RNA-directed DNA methylation and transcriptional silencing of retrotransposons and heterochromatic repeats in plants. We identified genomic sites of Pol V occupancy in parallel with siRNA deep sequencing and methylcytosine mapping, comparing wild-type plants with mutants defective for Pol IV, Pol V, or both Pols IV and V. Approximately 60% of Pol V-associated regions encompass regions of 24-nucleotide (nt) siRNA complementarity and cytosine methylation, consistent with cytosine methylation being guided by base-pairing of Pol IV-dependent siRNAs with Pol V transcripts. However, 27% of Pol V peaks do not overlap sites of 24-nt siRNA biogenesis or cytosine methylation, indicating that Pol V alone does not specify sites of cytosine methylation. Surprisingly, the number of methylated CHH motifs, a hallmark of RNA-directed de novo methylation, is similar in wild-type plants and Pol IV or Pol V mutants. In the mutants, methylation is lost at 50%-60% of the CHH sites that are methylated in the wild type but is gained at new CHH positions, primarily in pericentromeric regions. These results indicate that Pol IV and Pol V are not required for cytosine methyltransferase activity but shape the epigenome by guiding CHH methylation to specific genomic sites.


Assuntos
Proteínas de Arabidopsis , Arabidopsis , Citosina/metabolismo , Metilação de DNA , RNA Polimerases Dirigidas por DNA , Genoma de Planta , RNA Interferente Pequeno/metabolismo , Motivos de Aminoácidos , Arabidopsis/genética , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , RNA Polimerases Dirigidas por DNA/genética , RNA Polimerases Dirigidas por DNA/metabolismo , Regulação da Expressão Gênica de Plantas , Mutação , RNA Interferente Pequeno/genética
16.
J Proteome Res ; 18(1): 147-158, 2019 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-30511858

RESUMO

Large-scale proteomics projects often generate massive and highly redundant tandem mass spectra. Spectral clustering algorithms can reduce the redundancy in these data sets and thus speed up database searching for peptide identification, a major bottleneck for proteomic data analysis. The key challenge of spectral clustering is to reduce the redundancy in the MS/MS spectra data while retaining sufficient sensitivity to identify peptides from the clustered spectra. We present the software msCRUSH, which implements a novel spectral clustering algorithm based on the locality sensitive hashing technique. When tested on a large-scale proteomic data set consisting of 23.6 million spectra (including 14.4 million spectra of charge 2+), msCRUSH runs 6.9-11.3 times faster than the state-of-the-art spectral clustering software, PRIDE Cluster, while achieving higher clustering sensitivity and comparable accuracy. Using the consensus spectra reported by msCRUSH, commonly used spectra search engines MSGF+ and Mascot can identify 3 and 1% more unique peptides, respectively, compared with the identification results from the raw MS/MS spectra at the same false discovery rate (1% FDR) of peptide level. msCRUSH is implemented in C++ and is released as open-source software.


Assuntos
Análise por Conglomerados , Software , Espectrometria de Massas em Tandem/métodos , Algoritmos , Peptídeos/análise , Proteômica/métodos , Fatores de Tempo
17.
BMC Genomics ; 20(Suppl 12): 1002, 2019 Dec 30.
Artigo em Inglês | MEDLINE | ID: mdl-31888455

RESUMO

BACKGROUND: Bacterial cells during many replication cycles accumulate spontaneous mutations, which result in the birth of novel clones. As a result of this clonal expansion, an evolving bacterial population has different clonal composition over time, as revealed in the long-term evolution experiments (LTEEs). Accurately inferring the haplotypes of novel clones as well as the clonal frequencies and the clonal evolutionary history in a bacterial population is useful for the characterization of the evolutionary pressure on multiple correlated mutations instead of that on individual mutations. RESULTS: In this paper, we study the computational problem of reconstructing the haplotypes of bacterial clones from the variant allele frequencies observed from an evolving bacterial population at multiple time points. We formalize the problem using a maximum likelihood function, which is defined under the assumption that mutations occur spontaneously, and thus the likelihood of a mutation occurring in a specific clone is proportional to the frequency of the clone in the population when the mutation occurs. We develop a series of heuristic algorithms to address the maximum likelihood inference, and show through simulation experiments that the algorithms are fast and achieve near optimal accuracy that is practically plausible under the maximum likelihood framework. We also validate our method using experimental data obtained from a recent study on long-term evolution of Escherichia coli. CONCLUSION: We developed efficient algorithms to reconstruct the clonal evolution history from time course genomic sequencing data. Our algorithm can also incorporate clonal sequencing data to improve the reconstruction results when they are available. Based on the evaluation on both simulated and experimental sequencing data, our algorithms can achieve satisfactory results on the genome sequencing data from long-term evolution experiments. AVAILABILITY: The program (ClonalTREE) is available as open-source software on GitHub at https://github.com/COL-IU/ClonalTREE.


Assuntos
Algoritmos , Bactérias/genética , Evolução Clonal/genética , Genômica/métodos , Sequência de Bases , Frequência do Gene , Genoma Bacteriano/genética , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala , Funções Verossimilhança , Mutação , Software
18.
Mol Biol Evol ; 35(10): 2560-2571, 2018 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-30099533

RESUMO

Transposable elements (TEs) contribute to a large fraction of the expansion of many eukaryotic genomes due to the capability of TEs duplicating themselves through transposition. A first step to understanding the roles of TEs in a eukaryotic genome is to characterize the population-wide variation of TE insertions in the species. Here, we present a maximum-likelihood (ML) method for estimating allele frequencies and detecting selection on TE insertions in a diploid population, based on the genotypes at TE insertion sites detected in multiple individuals sampled from the population using paired-end (PE) sequencing reads. Tests of the method on simulated data show that it can accurately estimate the allele frequencies of TE insertions even when the PE sequencing is conducted at a relatively low coverage (=5X). The method can also detect TE insertions under strong selection, and the detection ability increases with sample size in a population, although a substantial fraction of actual TE insertions under selection may be undetected. Application of the ML method to genomic sequencing data collected from a natural Daphnia pulex population shows that, on the one hand, most (>90%) TE insertions present in the reference D. pulex genome are either fixed or nearly fixed (with allele frequencies >0.95); on the other hand, among the nonreference TE insertions (i.e., those detected in some individuals in the population but absent from the reference genome), the majority (>70%) are still at low frequencies (<0.1). Finally, we detected a substantial fraction (∼9%) of nonreference TE insertions under selection.


Assuntos
Elementos de DNA Transponíveis , Técnicas Genéticas , Mutagênese Insercional , Algoritmos , Animais , Daphnia , Frequência do Gene , Funções Verossimilhança , Seleção Genética , Sequenciamento Completo do Genoma
19.
Anal Chem ; 91(18): 11794-11802, 2019 09 17.
Artigo em Inglês | MEDLINE | ID: mdl-31356052

RESUMO

Glycosylation is an important post-translational modification of proteins. Many diseases, such as cancer, have proved to be related to aberrant glycosylation. High throughput quantitative methods have gained attention recently in the study of glycomics. With the development of high-resolution mass spectrometry, the sensitivity of detection in glycomics has largely improved; however, most of the commonly used MS-based techniques are focused on relative quantitative analysis, which can hardly provide direct comparative glycomic quantitation results. In this study, we developed a novel multiplex glycomic analysis method on an LC-ESI-MS platform. Reduced glycans were stable isotopic labeled during the permethylation procedure, with the use of iodomethane reagents CH2DI, CHD2I, CD3I, 13CH3I, 13CH2DI, 13CHD2I, 13CD3I, and CH3I. Up to 8-plex glycomic profiling was possible in a single analysis by LC-MS, and a 100 k mass resolution was sufficient to allow a baseline resolution of the mass differences among the 8-plex labeled glycans. The major advantages of this method are that it overcomes quantitative fluctuations caused by nanoESI, it facilitates a level of comparative quantitative glycomic analysis that accurately reflects the quantitative information in samples, and it dramatically shortens analysis time. Quantitation validation was tested on glycans released from bovine fetuin and model glycoprotein mixtures (RNase B, bovine fetuin, and IgG) with good linearity (R2 = 0.9884) and a dynamic range from 0.1 to 10. The 8-plex strategy was successfully applied to a comparative glycomic study of cancer cell lines. The results demonstrate that different distributions of sialylated glycans are related to the metastatic properties of cell lines and provide important clues for a better understanding of breast cancer brain metastasis.


Assuntos
Cromatografia Líquida/métodos , Glicômica/métodos , Hidrocarbonetos Iodados/química , Polissacarídeos/análise , Espectrometria de Massas em Tandem/métodos , Neoplasias Encefálicas/química , Neoplasias Encefálicas/patologia , Neoplasias Encefálicas/secundário , Neoplasias da Mama/química , Neoplasias da Mama/patologia , Isótopos de Carbono , Linhagem Celular Tumoral , Feminino , Glicoproteínas/química , Humanos , Metilação , Polissacarídeos/química , Reprodutibilidade dos Testes , Espectrometria de Massas por Ionização por Electrospray/instrumentação
20.
Bioinformatics ; 34(10): 1672-1681, 2018 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-29267878

RESUMO

Motivation: Rapid advancement in high throughput genome and transcriptome sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. We introduce a computational framework, ProTIE, to integratively analyze all three types of omics data for a complete molecular profile of a tissue sample. Our framework features MiStrVar, a novel algorithmic method to identify micro structural variants (microSVs) on genomic HTS data. Coupled with deFuse, a popular gene fusion detection method we developed earlier, MiStrVar can accurately profile structurally aberrant transcripts in tumors. Given the breakpoints obtained by MiStrVar and deFuse, our framework can then identify all relevant peptides that span the breakpoint junctions and match them with unique proteomic signatures. Observing structural aberrations in all three types of omics data validates their presence in the tumor samples. Results: We have applied our framework to all The Cancer Genome Atlas (TCGA) breast cancer Whole Genome Sequencing (WGS) and/or RNA-Seq datasets, spanning all four major subtypes, for which proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) have been released. A recent study on this dataset focusing on SNVs has reported many that lead to novel peptides. Complementing and significantly broadening this study, we detected 244 novel peptides from 432 candidate genomic or transcriptomic sequence aberrations. Many of the fusions and microSVs we discovered have not been reported in the literature. Interestingly, the vast majority of these translated aberrations, fusions in particular, were private, demonstrating the extensive inter-genomic heterogeneity present in breast cancer. Many of these aberrations also have matching out-of-frame downstream peptides, potentially indicating novel protein sequence and structure. Availability and implementation: MiStrVar is available for download at https://bitbucket.org/compbio/mistrvar, and ProTIE is available at https://bitbucket.org/compbio/protie. Contact: cenksahi@indiana.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Neoplasias da Mama/genética , Fusão Gênica , Proteínas de Neoplasias/genética , Proteogenômica/métodos , Software , Feminino , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Humanos , Espectrometria de Massas/métodos , Proteínas de Neoplasias/análise , Análise de Sequência de RNA/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA