Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 96
Filtrar
1.
Genome Res ; 34(6): 914-924, 2024 Jul 23.
Artigo em Inglês | MEDLINE | ID: mdl-38886068

RESUMO

Metagenomic long-read sequencing is gaining popularity for various applications, including pathogen detection and microbiome studies. To analyze the large data created in those studies, software tools need to taxonomically classify the sequenced molecules and estimate the relative abundances of organisms in the sequenced sample. Because of the exponential growth of reference genome databases, the current taxonomic classification methods have large computational requirements. This issue motivated us to develop a new data structure for fast and memory-efficient querying of long reads. Here, we present Taxor as a new tool for long-read metagenomic classification using a hierarchical interleaved XOR filter data structure for indexing and querying large reference genome sets. Taxor implements several k-mer-based approaches, such as syncmers, for pseudoalignment to classify reads and an expectation-maximization algorithm for metagenomic profiling. Our results show that Taxor outperforms state-of-the-art tools regarding precision while having a similar recall for long-read taxonomic classification. Most notably, Taxor reduces the memory requirements and index size by >50% and is among the fastest tools regarding query times. This enables real-time metagenomics analysis with large reference databases on a small laptop in the field.


Assuntos
Algoritmos , Metagenômica , Software , Metagenômica/métodos , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenoma , Análise de Sequência de DNA/métodos
2.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36545804

RESUMO

Monoclonal antibodies are biotechnologically produced proteins with various applications in research, therapeutics and diagnostics. Their ability to recognize and bind to specific molecule structures makes them essential research tools and therapeutic agents. Sequence information of antibodies is helpful for understanding antibody-antigen interactions and ensuring their affinity and specificity. De novo protein sequencing based on mass spectrometry is a valuable method to obtain the amino acid sequence of peptides and proteins without a priori knowledge. In this study, we evaluated six recently developed de novo peptide sequencing algorithms (Novor, pNovo 3, DeepNovo, SMSNet, PointNovo and Casanovo), which were not specifically designed for antibody data. We validated their ability to identify and assemble antibody sequences on three multi-enzymatic data sets. The deep learning-based tools Casanovo and PointNovo showed an increased peptide recall across different enzymes and data sets compared with spectrum-graph-based approaches. We evaluated different error types of de novo peptide sequencing tools and their performance for different numbers of missing cleavage sites, noisy spectra and peptides of various lengths. We achieved a sequence coverage of 97.69-99.53% on the light chains of three different antibody data sets using the de Bruijn assembler ALPS and the predictions from Casanovo. However, low sequence coverage and accuracy on the heavy chains demonstrate that complete de novo protein sequencing remains a challenging issue in proteomics that requires improved de novo error correction, alternative digestion strategies and hybrid approaches such as homology search to achieve high accuracy on long protein sequences.


Assuntos
Anticorpos Monoclonais , Peptídeos , Sequência de Aminoácidos , Anticorpos Monoclonais/genética , Peptídeos/genética , Peptídeos/química , Algoritmos , Análise de Sequência de Proteína/métodos
3.
Mol Cell Proteomics ; 22(3): 100509, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36791992

RESUMO

Lysosomes, the main degradative organelles of mammalian cells, play a key role in the regulation of metabolism. It is becoming more and more apparent that they are highly active, diverse, and involved in a large variety of processes. The essential role of lysosomes is exemplified by the detrimental consequences of their malfunction, which can result in lysosomal storage disorders, neurodegenerative diseases, and cancer. Using lysosome enrichment and mass spectrometry, we investigated the lysosomal proteomes of HEK293, HeLa, HuH-7, SH-SY5Y, MEF, and NIH3T3 cells. We provide evidence on a large scale for cell type-specific differences of lysosomes, showing that levels of distinct lysosomal proteins are highly variable within one cell type, while expression of others is highly conserved across several cell lines. Using differentially stable isotope-labeled cells and bimodal distribution analysis, we furthermore identify a high confidence population of lysosomal proteins for each cell line. Multi-cell line correlation of these data reveals potential novel lysosomal proteins, and we confirm lysosomal localization for six candidates. All data are available via ProteomeXchange with identifier PXD020600.


Assuntos
Neuroblastoma , Proteoma , Camundongos , Animais , Humanos , Proteoma/metabolismo , Células HEK293 , Células NIH 3T3 , Neuroblastoma/metabolismo , Lisossomos/metabolismo , Mamíferos/metabolismo
4.
Nucleic Acids Res ; 51(W1): W331-W337, 2023 07 05.
Artigo em Inglês | MEDLINE | ID: mdl-37167010

RESUMO

The mpox virus (MPXV) is mutating at an exceptional rate for a DNA virus and its global spread is concerning, making genomic surveillance a necessity. With MpoxRadar, we provide an interactive dashboard to track virus variants on mutation level worldwide. MpoxRadar allows users to select among different genomes as reference for comparison. The occurrence of mutation profiles based on the selected reference is indicated on an interactive world map that shows the respective geographic sampling site in customizable time ranges to easily follow the frequency or trend of defined mutations. Furthermore, the user can filter for specific mutations, genes, countries, genome types, and sequencing protocols and download the filtered data directly from MpoxRadar. On the server, we automatically download all MPXV genomes and metadata from the National Center for Biotechnology Information (NCBI) on a daily basis, align them to the different reference genomes, generate mutation profiles, which are stored and linked to the available metainformation in a database. This makes MpoxRadar a practical tool for the genomic survaillance of MPXV, supporting users with limited computational resources. MpoxRadar is open-source and freely accessible at https://MpoxRadar.net.


Assuntos
Genoma Viral , Genômica , Monkeypox virus , Software , Bases de Dados Factuais , Monkeypox virus/genética
5.
BMC Bioinformatics ; 25(1): 228, 2024 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-38956506

RESUMO

BACKGROUND: Fungi play a key role in several important ecological functions, ranging from organic matter decomposition to symbiotic associations with plants. Moreover, fungi naturally inhabit the human body and can be beneficial when administered as probiotics. In mycology, the internal transcribed spacer (ITS) region was adopted as the universal marker for classifying fungi. Hence, an accurate and robust method for ITS classification is not only desired for the purpose of better diversity estimation, but it can also help us gain a deeper insight into the dynamics of environmental communities and ultimately comprehend whether the abundance of certain species correlate with health and disease. Although many methods have been proposed for taxonomic classification, to the best of our knowledge, none of them fully explore the taxonomic tree hierarchy when building their models. This in turn, leads to lower generalization power and higher risk of committing classification errors. RESULTS: Here we introduce HiTaC, a robust hierarchical machine learning model for accurate ITS classification, which requires a small amount of data for training and can handle imbalanced datasets. HiTaC was thoroughly evaluated with the established TAXXI benchmark and could correctly classify fungal ITS sequences of varying lengths and a range of identity differences between the training and test data. HiTaC outperforms state-of-the-art methods when trained over noisy data, consistently achieving higher F1-score and sensitivity across different taxonomic ranks, improving sensitivity by 6.9 percentage points over top methods in the most noisy dataset available on TAXXI. CONCLUSIONS: HiTaC is publicly available at the Python package index, BIOCONDA and Docker Hub. It is released under the new BSD license, allowing free use in academia and industry. Source code and documentation, which includes installation and usage instructions, are available at https://gitlab.com/dacs-hpi/hitac .


Assuntos
Fungos , Aprendizado de Máquina , Fungos/genética , Fungos/classificação , DNA Espaçador Ribossômico/genética , Software
6.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34297793

RESUMO

Novel pathogens evolve quickly and may emerge rapidly, causing dangerous outbreaks or even global pandemics. Next-generation sequencing is the state of the art in open-view pathogen detection, and one of the few methods available at the earliest stages of an epidemic, even when the biological threat is unknown. Analyzing the samples as the sequencer is running can greatly reduce the turnaround time, but existing tools rely on close matches to lists of known pathogens and perform poorly on novel species. Machine learning approaches can predict if single reads originate from more distant, unknown pathogens but require relatively long input sequences and processed data from a finished sequencing run. Incomplete sequences contain less information, leading to a trade-off between sequencing time and detection accuracy. Using a workflow for real-time pathogenic potential prediction, we investigate which subsequences already allow accurate inference. We train deep neural networks to classify Illumina and Nanopore reads and integrate the models with HiLive2, a real-time Illumina mapper. This approach outperforms alternatives based on machine learning and sequence alignment on simulated and real data, including SARS-CoV-2 sequencing runs. After just 50 Illumina cycles, we observe an 80-fold sensitivity increase compared to real-time mapping. The first 250 bp of Nanopore reads, corresponding to 0.5 s of sequencing time, are enough to yield predictions more accurate than mapping the finished long reads. The approach could also be used for screening synthetic sequences against biosecurity threats.


Assuntos
COVID-19/genética , SARS-CoV-2/isolamento & purificação , COVID-19/virologia , Aprendizado Profundo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Nanoporos , Redes Neurais de Computação , SARS-CoV-2/genética , SARS-CoV-2/patogenicidade , Alinhamento de Sequência
7.
Bioinformatics ; 38(Suppl_2): ii168-ii174, 2022 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-36124807

RESUMO

BACKGROUND: Emerging pathogens are a growing threat, but large data collections and approaches for predicting the risk associated with novel agents are limited to bacteria and viruses. Pathogenic fungi, which also pose a constant threat to public health, remain understudied. Relevant data remain comparatively scarce and scattered among many different sources, hindering the development of sequencing-based detection workflows for novel fungal pathogens. No prediction method working for agents across all three groups is available, even though the cause of an infection is often difficult to identify from symptoms alone. RESULTS: We present a curated collection of fungal host range data, comprising records on human, animal and plant pathogens, as well as other plant-associated fungi, linked to publicly available genomes. We show that it can be used to predict the pathogenic potential of novel fungal species directly from DNA sequences with either sequence homology or deep learning. We develop learned, numerical representations of the collected genomes and visualize the landscape of fungal pathogenicity. Finally, we train multi-class models predicting if next-generation sequencing reads originate from novel fungal, bacterial or viral threats. CONCLUSIONS: The neural networks trained using our data collection enable accurate detection of novel fungal pathogens. A curated set of over 1400 genomes with host and pathogenicity metadata supports training of machine-learning models and sequence comparison, not limited to the pathogen detection task. AVAILABILITY AND IMPLEMENTATION: The data, models and code are hosted at https://zenodo.org/record/5846345, https://zenodo.org/record/5711877 and https://gitlab.com/dacs-hpi/deepac. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
DNA , Fungos , Animais , Bactérias/genética , Coleta de Dados , Fungos/genética , Humanos , Aprendizado de Máquina , Redes Neurais de Computação
8.
Bioinformatics ; 38(Suppl 1): i153-i160, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758774

RESUMO

MOTIVATION: Nanopore sequencers allow targeted sequencing of interesting nucleotide sequences by rejecting other sequences from individual pores. This feature facilitates the enrichment of low-abundant sequences by depleting overrepresented ones in-silico. Existing tools for adaptive sampling either apply signal alignment, which cannot handle human-sized reference sequences, or apply read mapping in sequence space relying on fast graphical processing units (GPU) base callers for real-time read rejection. Using nanopore long-read mapping tools is also not optimal when mapping shorter reads as usually analyzed in adaptive sampling applications. RESULTS: Here, we present a new approach for nanopore adaptive sampling that combines fast CPU and GPU base calling with read classification based on Interleaved Bloom Filters. ReadBouncer improves the potential enrichment of low abundance sequences by its high read classification sensitivity and specificity, outperforming existing tools in the field. It robustly removes even reads belonging to large reference sequences while running on commodity hardware without GPUs, making adaptive sampling accessible for in-field researchers. Readbouncer also provides a user-friendly interface and installer files for end-users without a bioinformatics background. AVAILABILITY AND IMPLEMENTATION: The C++ source code is available at https://gitlab.com/dacs-hpi/readbouncer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento por Nanoporos , Nanoporos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Software
9.
Bioinformatics ; 38(Suppl_2): ii113-ii119, 2022 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-36124784

RESUMO

MOTIVATION: While it has been well established that drugs affect and help patients differently, personalized drug response predictions remain challenging. Solutions based on single omics measurements have been proposed, and networks provide means to incorporate molecular interactions into reasoning. However, how to integrate the wealth of information contained in multiple omics layers still poses a complex problem. RESULTS: We present DrDimont, Drug response prediction from Differential analysis of multi-omics networks. It allows for comparative conclusions between two conditions and translates them into differential drug response predictions. DrDimont focuses on molecular interactions. It establishes condition-specific networks from correlation within an omics layer that are then reduced and combined into heterogeneous, multi-omics molecular networks. A novel semi-local, path-based integration step ensures integrative conclusions. Differential predictions are derived from comparing the condition-specific integrated networks. DrDimont's predictions are explainable, i.e. molecular differences that are the source of high differential drug scores can be retrieved. We predict differential drug response in breast cancer using transcriptomics, proteomics, phosphosite and metabolomics measurements and contrast estrogen receptor positive and receptor negative patients. DrDimont performs better than drug prediction based on differential protein expression or PageRank when evaluating it on ground truth data from cancer cell lines. We find proteomic and phosphosite layers to carry most information for distinguishing drug response. AVAILABILITY AND IMPLEMENTATION: DrDimont is available on CRAN: https://cran.r-project.org/package=DrDimont. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Neoplasias da Mama , Software , Neoplasias da Mama/tratamento farmacológico , Feminino , Humanos , Proteômica , Receptores de Estrogênio , Transcriptoma
10.
Bioinformatics ; 38(17): 4223-4225, 2022 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-35799354

RESUMO

SUMMARY: The ongoing pandemic caused by SARS-CoV-2 emphasizes the importance of genomic surveillance to understand the evolution of the virus, to monitor the viral population, and plan epidemiological responses. Detailed analysis, easy visualization and intuitive filtering of the latest viral sequences are powerful for this purpose. We present CovRadar, a tool for genomic surveillance of the SARS-CoV-2 Spike protein. CovRadar consists of an analytical pipeline and a web application that enable the analysis and visualization of hundreds of thousand sequences. First, CovRadar extracts the regions of interest using local alignment, then builds a multiple sequence alignment, infers variants and consensus and finally presents the results in an interactive app, making accessing and reporting simple, flexible and fast. AVAILABILITY AND IMPLEMENTATION: CovRadar is freely accessible at https://covradar.net, its open-source code is available at https://gitlab.com/dacs-hpi/covradar. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Genômica , Mutação
11.
J Proteome Res ; 21(4): 899-909, 2022 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-35086334

RESUMO

In liquid-chromatography-tandem-mass-spectrometry-based proteomics, information about the presence and stoichiometry of protein modifications is not readily available. To overcome this problem, we developed multiFLEX-LF, a computational tool that builds upon FLEXIQuant, which detects modified peptide precursors and quantifies their modification extent by monitoring the differences between observed and expected intensities of the unmodified precursors. multiFLEX-LF relies on robust linear regression to calculate the modification extent of a given precursor relative to a within-study reference. multiFLEX-LF can analyze entire label-free discovery proteomics data sets in a precursor-centric manner without preselecting a protein of interest. To analyze modification dynamics and coregulated modifications, we hierarchically clustered the precursors of all proteins based on their computed relative modification scores. We applied multiFLEX-LF to a data-independent-acquisition-based data set acquired using the anaphase-promoting complex/cyclosome (APC/C) isolated at various time points during mitosis. The clustering of the precursors allows for identifying varying modification dynamics and ordering the modification events. Overall, multiFLEX-LF enables the fast identification of potentially differentially modified peptide precursors and the quantification of their differential modification extent in large data sets using a personal computer. Additionally, multiFLEX-LF can drive the large-scale investigation of the modification dynamics of peptide precursors in time-series and case-control studies. multiFLEX-LF is available at https://gitlab.com/SteenOmicsLab/multiflex-lf.


Assuntos
Proteínas , Proteômica , Cromatografia Líquida , Espectrometria de Massas , Peptídeos
12.
Brief Bioinform ; 21(5): 1596-1608, 2020 09 25.
Artigo em Inglês | MEDLINE | ID: mdl-32978619

RESUMO

Bacterial proteins dubbed virulence factors (VFs) are a highly diverse group of sequences, whose only obvious commonality is the very property of being, more or less directly, involved in virulence. It is therefore tempting to speculate whether their prediction, based on direct sequence similarity (seqsim) to known VFs, could be enhanced or even replaced by using machine-learning methods. Specifically, when trained on a large and diverse set of VFs, such may be able to detect putative, non-trivial characteristics shared by otherwise unrelated VF families and therefore better predict novel VFs with insignificant similarity to each individual family. We therefore first reassess the performance of dimer-based Support Vector Machines, as used in the widely used MP3 method, in light of seqsim-only and seqsim/dimer-hybrid classifiers. We then repeat the analysis with a novel, considerably more diverse data set, also addressing the important problem of negative data selection. Finally, we move on to the real-world use case of proteome-wide VF prediction, outlining different approaches to estimating specificity in this scenario. We find that direct seqsim is of unparalleled importance and therefore should always be exploited. Further, we observe strikingly low correlations between different feature and classifier types when ranking proteins by VF likeness. We therefore propose a 'best of each world' approach to prioritize proteins for experimental testing, focussing on the top predictions of each classifier. Further, classifiers for individual VF families should be developed.


Assuntos
Bactérias/patogenicidade , Proteínas de Bactérias/metabolismo , Máquina de Vetores de Suporte , Fatores de Virulência/metabolismo , Algoritmos , Sequência de Aminoácidos , Proteínas de Bactérias/química , Conjuntos de Dados como Assunto , Dimerização , Proteoma , Fatores de Virulência/química
13.
J Proteome Res ; 20(4): 2083-2088, 2021 04 02.
Artigo em Inglês | MEDLINE | ID: mdl-33661648

RESUMO

The study of microbiomes has gained in importance over the past few years and has led to the emergence of the fields of metagenomics, metatranscriptomics, and metaproteomics. While initially focused on the study of biodiversity within these communities, the emphasis has increasingly shifted to the study of (changes in) the complete set of functions available in these communities. A key tool to study this functional complement of a microbiome is Gene Ontology (GO) term analysis. However, comparing large sets of GO terms is not an easy task due to the deeply branched nature of GO, which limits the utility of exact term matching. To solve this problem, we here present MegaGO, a user-friendly tool that relies on semantic similarity between GO terms to compute the functional similarity between multiple data sets. MegaGO is high performing: Each set can contain thousands of GO terms, and results are calculated in a matter of seconds. MegaGO is available as a web application at https://megago.ugent.be and is installable via pip as a standalone command line tool and reusable software library. All code is open source under the MIT license and is available at https://github.com/MEGA-GO/.


Assuntos
Microbiota , Software , Biologia Computacional , Ontologia Genética , Metagenômica , Semântica
14.
Bioinformatics ; 36(1): 81-89, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31298694

RESUMO

MOTIVATION: We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. Moreover, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious purposes. Traditional approaches to open-view pathogen detection depend on databases of known organisms, which limits their performance on unknown, unrecognized and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads, even though the biological context is unavailable. RESULTS: We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. We show that convolutional neural networks and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning. Combining a deep learning approach with integrating the predictions for both mates in a read pair results in cutting the error rate almost in half in comparison to the previous state-of-the-art. AVAILABILITY AND IMPLEMENTATION: The code and the models are available at: https://gitlab.com/rki_bioinformatics/DeePaC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , DNA , Aprendizado Profundo , Análise de Sequência de DNA
15.
Bioinformatics ; 36(Suppl_1): i12-i20, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657362

RESUMO

MOTIVATION: The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. RESULTS: Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. AVAILABILITY AND IMPLEMENTATION: The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Metagenômica , Archaea , Análise de Sequência de DNA , Software
16.
Mol Cell Proteomics ; 18(9): 1756-1771, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31221721

RESUMO

Epithelial-mesenchymal transition (EMT) is driven by complex signaling events that induce dramatic biochemical and morphological changes whereby epithelial cells are converted into cancer cells. However, the underlying molecular mechanisms remain elusive. Here, we used mass spectrometry based quantitative proteomics approach to systematically analyze the post-translational biochemical changes that drive differentiation of human mammary epithelial (HMLE) cells into mesenchymal. We identified 314 proteins out of more than 6,000 unique proteins and 871 phosphopeptides out of more than 7,000 unique phosphopeptides as differentially regulated. We found that phosphoproteome is more unstable and prone to changes during EMT compared with the proteome and multiple alterations at proteome level are not thoroughly represented by transcriptional data highlighting the necessity of proteome level analysis. We discovered cell state specific signaling pathways, such as Hippo, sphingolipid signaling, and unfolded protein response (UPR) by modeling the networks of regulated proteins and potential kinase-substrate groups. We identified two novel factors for EMT whose expression increased on EMT induction: DnaJ heat shock protein family (Hsp40) member B4 (DNAJB4) and cluster of differentiation 81 (CD81). Suppression of DNAJB4 or CD81 in mesenchymal breast cancer cells resulted in decreased cell migration in vitro and led to reduced primary tumor growth, extravasation, and lung metastasis in vivo Overall, we performed the global proteomic and phosphoproteomic analyses of EMT, identified and validated new mRNA and/or protein level modulators of EMT. This work also provides a unique platform and resource for future studies focusing on metastasis and drug resistance.


Assuntos
Neoplasias da Mama/patologia , Transição Epitelial-Mesenquimal/fisiologia , Proteínas de Choque Térmico HSP40/metabolismo , Fosfoproteínas/metabolismo , Tetraspanina 28/metabolismo , Animais , Neoplasias da Mama/metabolismo , Neoplasias da Mama/mortalidade , Linhagem Celular Tumoral , Movimento Celular/fisiologia , Transição Epitelial-Mesenquimal/genética , Feminino , Proteínas de Choque Térmico HSP40/genética , Humanos , Estimativa de Kaplan-Meier , Neoplasias Mamárias Experimentais/patologia , Camundongos Nus , Reprodutibilidade dos Testes , Tetraspanina 28/genética
17.
Euro Surveill ; 26(2)2021 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-33446303

RESUMO

IntroductionImproving the surveillance of tuberculosis (TB) is especially important for multidrug-resistant (MDR) and extensively drug-resistant (XDR) TB. The large amount of publicly available whole genome sequencing (WGS) data for TB gives us the chance to re-use data and to perform additional analyses at a large scale.AimWe assessed the usefulness of raw WGS data of global MDR/XDR Mycobacterium tuberculosis isolates available from public repositories to improve TB surveillance.MethodsWe extracted raw WGS data and the related metadata of M. tuberculosis isolates available from the Sequence Read Archive. We compared this public dataset with WGS data and metadata of 131 MDR- and XDR M. tuberculosis isolates from Germany in 2012 and 2013.ResultsWe aggregated a dataset that included 1,081 MDR and 250 XDR isolates among which we identified 133 molecular clusters. In 16 clusters, the isolates were from at least two different countries. For example, Cluster 2 included 56 MDR/XDR isolates from Moldova, Georgia and Germany. When comparing the WGS data from Germany with the public dataset, we found that 11 clusters contained at least one isolate from Germany and at least one isolate from another country. We could, therefore, connect TB cases despite missing epidemiological information.ConclusionWe demonstrated the added value of using WGS raw data from public repositories to contribute to TB surveillance. Comparing the German with the public dataset, we identified potential international transmission events. Thus, using this approach might support the interpretation of national surveillance results in an international context.


Assuntos
Mycobacterium tuberculosis , Tuberculose Resistente a Múltiplos Medicamentos , Antituberculosos/uso terapêutico , Farmacorresistência Bacteriana Múltipla/genética , Georgia , Alemanha/epidemiologia , Humanos , Mycobacterium tuberculosis/genética , Tuberculose Resistente a Múltiplos Medicamentos/diagnóstico , Tuberculose Resistente a Múltiplos Medicamentos/tratamento farmacológico , Tuberculose Resistente a Múltiplos Medicamentos/epidemiologia , Sequenciamento Completo do Genoma
18.
J Proteome Res ; 19(6): 2501-2510, 2020 06 05.
Artigo em Inglês | MEDLINE | ID: mdl-32362126

RESUMO

Untargeted accurate strain-level classification of a priori unidentified organisms using tandem mass spectrometry is a challenging task. Reference databases often lack taxonomic depth, limiting peptide assignments to the species level. However, the extension with detailed strain information increases runtime and decreases statistical power. In addition, larger databases contain a higher number of similar proteomes. We present TaxIt, an iterative workflow to address the increasing search space required for MS/MS-based strain-level classification of samples with unknown taxonomic origin. TaxIt first applies reference sequence data for initial identification of species candidates, followed by automated acquisition of relevant strain sequences for low level classification. Furthermore, proteome similarities resulting in ambiguous taxonomic assignments are addressed with an abundance weighting strategy to increase the confidence in candidate taxa. For benchmarking the performance of our method, we apply our iterative workflow on several samples of bacterial and viral origin. In comparison to noniterative approaches using unique peptides or advanced abundance correction, TaxIt identifies microbial strains correctly in all examples presented (with one tie), thereby demonstrating the potential for untargeted and deeper taxonomic classification. TaxIt makes extensive use of public, unrestricted, and continuously growing sequence resources such as the NCBI databases and is available under open-source BSD license at https://gitlab.com/rki_bioinformatics/TaxIt.


Assuntos
Proteômica , Espectrometria de Massas em Tandem , Bases de Dados de Proteínas , Peptídeos , Proteoma , Software
19.
J Proteome Res ; 19(8): 3562-3566, 2020 08 07.
Artigo em Inglês | MEDLINE | ID: mdl-32431147

RESUMO

Although metaproteomics, the study of the collective proteome of microbial communities, has become increasingly powerful and popular over the past few years, the field has lagged behind on the availability of user-friendly, end-to-end pipelines for data analysis. We therefore describe the connection from two commonly used metaproteomics data processing tools in the field, MetaProteomeAnalyzer and PeptideShaker, to Unipept for downstream analysis. Through these connections, direct end-to-end pipelines are built from database searching to taxonomic and functional annotation.


Assuntos
Análise de Dados , Microbiota , Proteoma , Proteômica , Software
20.
Brief Bioinform ; 19(5): 954-970, 2018 09 28.
Artigo em Inglês | MEDLINE | ID: mdl-28369237

RESUMO

While peptide identifications in mass spectrometry (MS)-based shotgun proteomics are mostly obtained using database search methods, high-resolution spectrum data from modern MS instruments nowadays offer the prospect of improving the performance of computational de novo peptide sequencing. The major benefit of de novo sequencing is that it does not require a reference database to deduce full-length or partial tag-based peptide sequences directly from experimental tandem mass spectrometry spectra. Although various algorithms have been developed for automated de novo sequencing, the prediction accuracy of proposed solutions has been rarely evaluated in independent benchmarking studies. The main objective of this work is to provide a detailed evaluation on the performance of de novo sequencing algorithms on high-resolution data. For this purpose, we processed four experimental data sets acquired from different instrument types from collision-induced dissociation and higher energy collisional dissociation (HCD) fragmentation mode using the software packages Novor, PEAKS and PepNovo. Moreover, the accuracy of these algorithms is also tested on ground truth data based on simulated spectra generated from peak intensity prediction software. We found that Novor shows the overall best performance compared with PEAKS and PepNovo with respect to the accuracy of correct full peptide, tag-based and single-residue predictions. In addition, the same tool outpaced the commercial competitor PEAKS in terms of running time speedup by factors of around 12-17. Despite around 35% prediction accuracy for complete peptide sequences on HCD data sets, taken as a whole, the evaluated algorithms perform moderately on experimental data but show a significantly better performance on simulated data (up to 84% accuracy). Further, we describe the most frequently occurring de novo sequencing errors and evaluate the influence of missing fragment ion peaks and spectral noise on the accuracy. Finally, we discuss the potential of de novo sequencing for now becoming more widely used in the field.


Assuntos
Algoritmos , Proteômica/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Camundongos , Peptídeos/química , Proteômica/estatística & dados numéricos , Pyrococcus furiosus/genética , Saccharomyces cerevisiae/genética , Análise de Sequência de Proteína/estatística & dados numéricos , Sitios de Sequências Rotuladas , Software , Espectrometria de Massas em Tandem/métodos , Espectrometria de Massas em Tandem/estatística & dados numéricos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA