Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 61
1.
bioRxiv ; 2024 Mar 25.
Article En | MEDLINE | ID: mdl-38585907

The biological process of RNA translation is fundamental to cellular life and has wide-ranging implications for human disease. Yet, accurately delineating the variation in RNA translation represents a significant challenge. Here, we develop RiboTIE, a transformer model-based approach to map global RNA translation. We find that RiboTIE offers unparalleled precision and sensitivity for ribosome profiling data. Application of RiboTIE to normal brain and medulloblastoma cancer samples enables high-resolution insights into disease regulation of RNA translation.

2.
NAR Genom Bioinform ; 5(1): lqad021, 2023 Mar.
Article En | MEDLINE | ID: mdl-36879896

The correct mapping of the proteome is an important step towards advancing our understanding of biological systems and cellular mechanisms. Methods that provide better mappings can fuel important processes such as drug discovery and disease understanding. Currently, true determination of translation initiation sites is primarily achieved by in vivo experiments. Here, we propose TIS Transformer, a deep learning model for the determination of translation start sites solely utilizing the information embedded in the transcript nucleotide sequence. The method is built upon deep learning techniques first designed for natural language processing. We prove this approach to be best suited for learning the semantics of translation, outperforming previous approaches by a large margin. We demonstrate that limitations in the model performance are primarily due to the presence of low-quality annotations against which the model is evaluated against. Advantages of the method are its ability to detect key features of the translation process and multiple coding sequences on a transcript. These include micropeptides encoded by short Open Reading Frames, either alongside a canonical coding sequence or within long non-coding RNAs. To demonstrate the use of our methods, we applied TIS Transformer to remap the full human proteome.

5.
Bioinformatics ; 38(3): 597-603, 2022 01 12.
Article En | MEDLINE | ID: mdl-34718418

MOTIVATION: The adoption of current single-cell DNA methylation sequencing protocols is hindered by incomplete coverage, outlining the need for effective imputation techniques. The task of imputing single-cell (methylation) data requires models to build an understanding of underlying biological processes. RESULTS: We adapt the transformer neural network architecture to operate on methylation matrices through combining axial attention with sliding window self-attention. The obtained CpG Transformer displays state-of-the-art performances on a wide range of scBS-seq and scRRBS-seq datasets. Furthermore, we demonstrate the interpretability of CpG Transformer and illustrate its rapid transfer learning properties, allowing practitioners to train models on new datasets with a limited computational and time budget. AVAILABILITY AND IMPLEMENTATION: CpG Transformer is freely available at https://github.com/gdewael/cpg-transformer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


DNA Methylation , Epigenome , Base Sequence , Sequence Analysis, DNA/methods , Neural Networks, Computer
6.
Front Genet ; 12: 728900, 2021.
Article En | MEDLINE | ID: mdl-34759956

Transcriptome and ribosome sequencing have revealed the existence of many non-canonical transcripts, mainly containing splice variants, ncRNA, sORFs and altORFs. However, identification and characterization of products that may be translated out of these remains a challenge. Addressing this, we here report on 552 non-canonical proteins and splice variants in the model organism C. elegans using tandem mass spectrometry. Aided by sequencing-based prediction, we generated a custom proteome database tailored to search for non-canonical translation products of C. elegans. Using this database, we mined available mass spectrometric resources of C. elegans, from which 51 novel, non-canonical proteins could be identified. Furthermore, we utilized diverse proteomic and peptidomic strategies to detect 40 novel non-canonical proteins in C. elegans by LC-TIMS-MS/MS, of which 6 were common with our meta-analysis of existing resources. Together, this permits us to provide a resource with detailed annotation of 467 splice variants and 85 novel proteins mapped onto UTRs, non-coding regions and alternative open reading frames of the C. elegans genome.

7.
Front Cell Dev Biol ; 9: 720570, 2021.
Article En | MEDLINE | ID: mdl-34604223

Bioactive peptides exhibit key roles in a wide variety of complex processes, such as regulation of body weight, learning, aging, and innate immune response. Next to the classical bioactive peptides, emerging from larger precursor proteins by specific proteolytic processing, a new class of peptides originating from small open reading frames (sORFs) have been recognized as important biological regulators. But their intrinsic properties, specific expression pattern and location on presumed non-coding regions have hindered the full characterization of the repertoire of bioactive peptides, despite their predominant role in various pathways. Although the development of peptidomics has offered the opportunity to study these peptides in vivo, it remains challenging to identify the full peptidome as the lack of cleavage enzyme specification and large search space complicates conventional database search approaches. In this study, we introduce a proteogenomics methodology using a new type of mass spectrometry instrument and the implementation of machine learning tools toward improved identification of potential bioactive peptides in the mouse brain. The application of trapped ion mobility spectrometry (tims) coupled to a time-of-flight mass analyzer (TOF) offers improved sensitivity, an enhanced peptide coverage, reduction in chemical noise and the reduced occurrence of chimeric spectra. Subsequent machine learning tools MS2PIP, predicting fragment ion intensities and DeepLC, predicting retention times, improve the database searching based on a large and comprehensive custom database containing both sORFs and alternative ORFs. Finally, the identification of peptides is further enhanced by applying the post-processing semi-supervised learning tool Percolator. Applying this workflow, the first peptidomics workflow combined with spectral intensity and retention time predictions, we identified a total of 167 predicted sORF-encoded peptides, of which 48 originating from presumed non-coding locations, next to 401 peptides from known neuropeptide precursors, linked to 66 annotated bioactive neuropeptides from within 22 different families. Additional PEAKS analysis expanded the pool of SEPs on presumed non-coding locations to 84, while an additional 204 peptides completed the list of peptides from neuropeptide precursors. Altogether, this study provides insights into a new robust pipeline that fuses technological advancements from different fields ensuring an improved coverage of the neuropeptidome in the mouse brain.

8.
Mol Cell Proteomics ; 20: 100076, 2021.
Article En | MEDLINE | ID: mdl-33823297

Proteogenomics approaches often struggle with the distinction between true and false peptide-to-spectrum matches as the database size enlarges. However, features extracted from tandem mass spectrometry intensity predictors can enhance the peptide identification rate and can provide extra confidence for peptide-to-spectrum matching in a proteogenomics context. To that end, features from the spectral intensity pattern predictors MS2PIP and Prosit were combined with the canonical scores from MaxQuant in the Percolator postprocessing tool for protein sequence databases constructed out of ribosome profiling and nanopore RNA-Seq analyses. The presented results provide evidence that this approach enhances both the identification rate as well as the validation stringency in a proteogenomic setting.


Proteogenomics/methods , Databases, Protein , HCT116 Cells , Humans , Machine Learning , RNA-Seq , Ribosomes
9.
Brief Bioinform ; 22(5)2021 09 02.
Article En | MEDLINE | ID: mdl-33834200

The effectiveness of deep learning methods can be largely attributed to the automated extraction of relevant features from raw data. In the field of functional genomics, this generally concerns the automatic selection of relevant nucleotide motifs from DNA sequences. To benefit from automated learning methods, new strategies are required that unveil the decision-making process of trained models. In this paper, we present a new approach that has been successful in gathering insights on the transcription process in Escherichia coli. This work builds upon a transformer-based neural network framework designed for prokaryotic genome annotation purposes. We find that the majority of subunits (attention heads) of the model are specialized towards identifying transcription factors and are able to successfully characterize both their binding sites and consensus sequences, uncovering both well-known and potentially novel elements involved in the initiation of the transcription process. With the specialization of the attention heads occurring automatically, we believe transformer models to be of high interest towards the creation of explainable neural networks in this field.


Deep Learning , Escherichia coli/genetics , Genome, Bacterial , Genomics/methods , Transcription Initiation Site , Base Sequence , Binding Sites , DNA, Bacterial/genetics , DNA, Bacterial/metabolism , Escherichia coli/metabolism , Promoter Regions, Genetic/genetics , Transcription Factors/genetics , Transcription Factors/metabolism
10.
Nat Cancer ; 2(6): 611-628, 2021 06.
Article En | MEDLINE | ID: mdl-35121941

Post-transcriptional modifications of RNA constitute an emerging regulatory layer of gene expression. The demethylase fat mass- and obesity-associated protein (FTO), an eraser of N6-methyladenosine (m6A), has been shown to play a role in cancer, but its contribution to tumor progression and the underlying mechanisms remain unclear. Here, we report widespread FTO downregulation in epithelial cancers associated with increased invasion, metastasis and worse clinical outcome. Both in vitro and in vivo, FTO silencing promotes cancer growth, cell motility and invasion. In human-derived tumor xenografts (PDXs), FTO pharmacological inhibition favors tumorigenesis. Mechanistically, we demonstrate that FTO depletion elicits an epithelial-to-mesenchymal transition (EMT) program through increased m6A and altered 3'-end processing of key mRNAs along the Wnt signaling cascade. Accordingly, FTO knockdown acts via EMT to sensitize mouse xenografts to Wnt inhibition. We thus identify FTO as a key regulator, across epithelial cancers, of Wnt-triggered EMT and tumor progression and reveal a therapeutically exploitable vulnerability of FTO-low tumors.


Neoplasms, Glandular and Epithelial , RNA , Alpha-Ketoglutarate-Dependent Dioxygenase FTO/genetics , Animals , Down-Regulation/genetics , Epithelial-Mesenchymal Transition/genetics , Humans , Mice
11.
Nat Commun ; 11(1): 4956, 2020 10 02.
Article En | MEDLINE | ID: mdl-33009383

Tet-enzyme-mediated 5-hydroxymethylation of cytosines in DNA plays a crucial role in mouse embryonic stem cells (ESCs). In RNA also, 5-hydroxymethylcytosine (5hmC) has recently been evidenced, but its physiological roles are still largely unknown. Here we show the contribution and function of this mark in mouse ESCs and differentiating embryoid bodies. Transcriptome-wide mapping in ESCs reveals hundreds of messenger RNAs marked by 5hmC at sites characterized by a defined unique consensus sequence and particular features. During differentiation a large number of transcripts, including many encoding key pluripotency-related factors (such as Eed and Jarid2), show decreased cytosine hydroxymethylation. Using Tet-knockout ESCs, we find Tet enzymes to be partly responsible for deposition of 5hmC in mRNA. A transcriptome-wide search further reveals mRNA targets to which Tet1 and Tet2 bind, at sites showing a topology similar to that of 5hmC sites. Tet-mediated RNA hydroxymethylation is found to reduce the stability of crucial pluripotency-promoting transcripts. We propose that RNA cytosine 5-hydroxymethylation by Tets is a mark of transcriptome flexibility, inextricably linked to the balance between pluripotency and lineage commitment.


5-Methylcytosine/analogs & derivatives , Cell Differentiation , DNA-Binding Proteins/metabolism , Mouse Embryonic Stem Cells/cytology , Mouse Embryonic Stem Cells/metabolism , Proto-Oncogene Proteins/metabolism , RNA/metabolism , 5-Methylcytosine/metabolism , Animals , Antibody Specificity/immunology , Base Sequence , Dioxygenases , Embryoid Bodies/metabolism , Mice , Models, Biological , Pluripotent Stem Cells/metabolism , Protein Binding , RNA Stability/genetics , RNA, Messenger/genetics , RNA, Messenger/metabolism , Transcriptome/genetics
12.
Exp Cell Res ; 391(1): 111923, 2020 06 01.
Article En | MEDLINE | ID: mdl-32135166

Growing evidence illustrates the shortcomings on the current understanding of the full complexity of the proteome. Previously overlooked small open reading frames (sORFs) and their encoded microproteins have filled important gaps, exerting their function as biologically relevant regulators. The characterization of the full small proteome has potential applications in many fields. Continuous development of techniques and tools led to an improved sORF discovery, where these can originate from bioinformatics analyses, from sequencing routines or proteomics approaches. In this mini review, we discuss the ongoing trends in the three fields and suggest some strategies for further characterization of high potential candidates.


Computational Biology/statistics & numerical data , Neural Networks, Computer , Open Reading Frames , Protein Biosynthesis , Proteome/genetics , Ribosomes/genetics , Animals , Computational Biology/methods , High-Throughput Nucleotide Sequencing , Humans , Plants/genetics , Protein Sorting Signals/genetics , Proteome/classification , Proteome/metabolism , Ribosomes/classification , Ribosomes/metabolism , Software
13.
Nat Commun ; 11(1): 1312, 2020 03 11.
Article En | MEDLINE | ID: mdl-32161263

The emergence of small open reading frame (sORF)-encoded peptides (SEPs) is rapidly expanding the known proteome at the lower end of the size distribution. Here, we show that the mitochondrial proteome, particularly the respiratory chain, is enriched for small proteins. Using a prediction and validation pipeline for SEPs, we report the discovery of 16 endogenous nuclear encoded, mitochondrial-localized SEPs (mito-SEPs). Through functional prediction, proteomics, metabolomics and metabolic flux modeling, we demonstrate that BRAWNIN, a 71 a.a. peptide encoded by C12orf73, is essential for respiratory chain complex III (CIII) assembly. In human cells, BRAWNIN is induced by the energy-sensing AMPK pathway, and its depletion impairs mitochondrial ATP production. In zebrafish, Brawnin deletion causes complete CIII loss, resulting in severe growth retardation, lactic acidosis and early death. Our findings demonstrate that BRAWNIN is essential for vertebrate oxidative phosphorylation. We propose that mito-SEPs are an untapped resource for essential regulators of oxidative metabolism.


Electron Transport Complex III/metabolism , Mitochondria/metabolism , Mitochondrial Proteins/metabolism , Oxidative Phosphorylation , Peptides/metabolism , Zebrafish Proteins/metabolism , Acidosis, Lactic/genetics , Animals , Animals, Genetically Modified , Disease Models, Animal , Female , Gene Knockdown Techniques , Growth Disorders/genetics , Humans , Male , Metabolomics , Mitochondrial Proteins/genetics , Models, Animal , Models, Biological , Open Reading Frames/genetics , Peptides/genetics , Proteomics , Zebrafish/genetics , Zebrafish/growth & development , Zebrafish Proteins/genetics
14.
Neurosci Res ; 151: 31-37, 2020 Feb.
Article En | MEDLINE | ID: mdl-30862443

Brain derived peptides function as signaling molecules in the brain and regulate various physiological and behavioral processes. The low abundance and atypical fragmentation of these brain derived peptides makes detection using traditional proteomic methods challenging. In this study, we introduce and validate a new methodology for the discovery of novel peptides derived from mammalian brain. This methodology combines ribosome profiling and mass spectrometry-based peptidomics. Using this framework, we have identified a novel peptide in mouse whole brain whose expression is highest in the basal ganglia, hypothalamus and amygdala. Although its functional role is unknown, it has been previously detected in peripheral tissue as a component of the mRNA decapping complex. Continued discovery and studies of novel regulating peptides in mammalian brain may also provide insight into brain disorders.


Neuropeptides/isolation & purification , Proteomics/methods , Animals , Brain/metabolism , Male , Mass Spectrometry , Mice , Mice, Inbred C57BL , Neuropeptides/analysis , Peptides , Ribosomes , Sequence Analysis, Protein
15.
PLoS One ; 14(9): e0215185, 2019.
Article En | MEDLINE | ID: mdl-31545805

Neuropeptides are a class of bioactive peptides shown to be involved in various physiological processes, including metabolism, development, and reproduction. Although neuropeptide candidates have been predicted from genomic and transcriptomic data, comprehensive characterization of neuropeptide repertoires remains a challenge owing to their small size and variable sequences. De novo prediction of neuropeptides from genome or transcriptome data is difficult and usually only efficient for those peptides that have identified orthologs in other animal species. Recent peptidomics technology has enabled systematic structural identification of neuropeptides by using the combination of liquid chromatography and tandem mass spectrometry. However, reliable identification of naturally occurring peptides using a conventional tandem mass spectrometry approach, scanning spectra against a protein database, remains difficult because a large search space must be scanned due to the absence of a cleavage enzyme specification. We developed a pipeline consisting of in silico prediction of candidate neuropeptides followed by peptide-spectrum matching. This approach enables highly sensitive and reliable neuropeptide identification, as the search space for peptide-spectrum matching is highly reduced. Nematostella vectensis is a basal eumetazoan with one of the most ancient nervous systems. We scanned the Nematostella protein database for sequences displaying structural hallmarks typical of eumetazoan neuropeptide precursors, including amino- and carboxyterminal motifs and associated modifications. Peptide-spectrum matching was performed against a dataset of peptides that are cleaved in silico from these putative peptide precursors. The dozens of newly identified neuropeptides display structural similarities to bilaterian neuropeptides including tachykinin, myoinhibitory peptide, and neuromedin-U/pyrokinin, suggesting these neuropeptides occurred in the eumetazoan ancestor of all animal species.


Evolution, Molecular , Neuropeptides/genetics , Sea Anemones/chemistry , Sea Anemones/genetics , Tandem Mass Spectrometry , Amino Acid Sequence , Animals , Computational Biology/methods , Conserved Sequence , Databases, Genetic , Gene Expression , Neuropeptides/chemistry , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization
16.
Genes (Basel) ; 10(9)2019 09 05.
Article En | MEDLINE | ID: mdl-31492022

The increasing availability of high throughput proteomics data provides us with opportunities as well as posing new ethical challenges regarding data privacy and re-identifiability of participants. Moreover, the fact that proteomics represents a level between the genotype and the phenotype further exacerbates the situation, introducing dilemmas related to publicly available data, anonymization, ownership of information and incidental findings. In this paper, we try to differentiate proteomics from genomics data and cover the ethical challenges related to proteomics data sharing. Finally, we give an overview of the proposed solutions and the outlook for future studies.


Genetic Privacy/standards , Precision Medicine/ethics , Proteomics/ethics , Humans , Informed Consent/standards , Precision Medicine/methods , Proteomics/methods
17.
Article En | MEDLINE | ID: mdl-31238262

On average a human cell type expresses around 10,000 different protein coding genes synthesizing all the different molecular forms of the protein product (proteoforms) found in a cell. In a typical shotgun bottom up proteomic approach, the proteins are enzymatically cleaved, producing several 100,000 s of different peptides that are analyzed with liquid chromatography-tandem mass spectrometry (LC-MSMS). One of the major consequences of this high sample complexity is that coelution of peptides cannot be avoided. Moreover, low abundant peptides are difficult to identify as they have a lower chance of being selected for fragmentation due to ion-suppression effects and the semi-stochastic nature of the precursor selection in data-dependent shotgun proteomic analysis where peptides are selected for fragmentation analysis one-by-one as they elute from the column. In the current study we explore a simple novel approach that has the potential to counter some of the effect of coelution of peptides and improves the number of peptide identifications in a bottom-up proteomic analysis. In this method, peptides from a HeLa cell digest were eluted from the reverse phase column using three different elution solvents (acetonitrile, methanol and acetone) in three replicate reversed phase LC-MS/MS shotgun proteomic analysis. Results were compared with three technical replicates using the same solvent, which is common practice in proteomic analysis. In total, we see an increase of up to 10% in unique protein and up to 30% in unique peptide identifications from the combined analysis using different elution solvents when compared to the combined identifications from the three replicates of the same solvent. In addition, the overlap of unique peptide identifications common in all three LC-MS analyses in our approach is only 23% compared to 50% in the replicates using the same solvent. The method presented here thus provides an easy to implement method to significantly reduce the effects of coelution and ion suppression of peptides and improve protein coverage in shotgun proteomics. Data are available via ProteomeXchange with identifier PXD011908.


Chromatography, Liquid/methods , Proteome/chemistry , Proteomics/methods , Tandem Mass Spectrometry/methods , HeLa Cells , Humans , Peptides/chemistry
18.
Mol Cell Proteomics ; 18(8 suppl 1): S126-S140, 2019 08 09.
Article En | MEDLINE | ID: mdl-31040227

PROTEOFORMER is a pipeline that enables the automated processing of data derived from ribosome profiling (RIBO-seq, i.e. the sequencing of ribosome-protected mRNA fragments). As such, genome-wide ribosome occupancies lead to the delineation of data-specific translation product candidates and these can improve the mass spectrometry-based identification. Since its first publication, different upgrades, new features and extensions have been added to the PROTEOFORMER pipeline. Some of the most important upgrades include P-site offset calculation during mapping, comprehensive data pre-exploration, the introduction of two alternative proteoform calling strategies and extended pipeline output features. These novelties are illustrated by analyzing ribosome profiling data of human HCT116 and Jurkat data. The different proteoform calling strategies are used alongside one another and in the end combined together with reference sequences from UniProt. Matching mass spectrometry data are searched against this extended search space with MaxQuant. Overall, besides annotated proteoforms, this pipeline leads to the identification and validation of different categories of new proteoforms, including translation products of up- and downstream open reading frames, 5' and 3' extended and truncated proteoforms, single amino acid variants, splice variants and translation products of so-called noncoding regions. Further, proof-of-concept is reported for the improvement of spectrum matching by including Prosit, a deep neural network strategy that adds extra fragmentation spectrum intensity features to the analysis. In the light of ribosome profiling-driven proteogenomics, it is shown that this allows validating the spectrum matches of newly identified proteoforms with elevated stringency. These updates and novel conclusions provide new insights and lessons for the ribosome profiling-based proteogenomic research field. More practical information on the pipeline, raw code, the user manual (README) and explanations on the different modes of availability can be found at the GitHub repository of PROTEOFORMER: https://github.com/Biobix/proteoformer.


Proteogenomics/methods , Ribosomes/metabolism , Chromatography, Liquid , HCT116 Cells , Humans , Jurkat Cells , Tandem Mass Spectrometry
19.
J Proteome Res ; 18(6): 2686-2692, 2019 06 07.
Article En | MEDLINE | ID: mdl-31081335

Mass-spectrometry-based proteomics enables the high-throughput identification and quantification of proteins, including sequence variants and post-translational modifications (PTMs) in biological samples. However, most workflows require that such variations be included in the search space used to analyze the data, and doing so remains challenging with most analysis tools. In order to facilitate the search for known sequence variants and PTMs, the Proteomics Standards Initiative (PSI) has designed and implemented the PSI extended FASTA format (PEFF). PEFF is based on the very popular FASTA format but adds a uniform mechanism for encoding substantially more metadata about the sequence collection as well as individual entries, including support for encoding known sequence variants, PTMs, and proteoforms. The format is very nearly backward compatible, and as such, existing FASTA parsers will require little or no changes to be able to read PEFF files as FASTA files, although without supporting any of the extra capabilities of PEFF. PEFF is defined by a full specification document, controlled vocabulary terms, a set of example files, software libraries, and a file validator. Popular software and resources are starting to support PEFF, including the sequence search engine Comet and the knowledge bases neXtProt and UniProtKB. Widespread implementation of PEFF is expected to further enable proteogenomics and top-down proteomics applications by providing a standardized mechanism for encoding protein sequences and their known variations. All the related documentation, including the detailed file format specification and example files, are available at http://www.psidev.info/peff .


Proteomics/standards , Humans , Information Storage and Retrieval , Mass Spectrometry , Software
20.
Nucleic Acids Res ; 47(6): e36, 2019 04 08.
Article En | MEDLINE | ID: mdl-30753697

Annotation of gene expression in prokaryotes often finds itself corrected due to small variations of the annotated gene regions observed between different (sub)-species. It has become apparent that traditional sequence alignment algorithms, used for the curation of genomes, are not able to map the full complexity of the genomic landscape. We present DeepRibo, a novel neural network utilizing features extracted from ribosome profiling information and binding site sequence patterns that shows to be a precise tool for the delineation and annotation of expressed genes in prokaryotes. The neural network combines recurrent memory cells and convolutional layers, adapting the information gained from both the high-throughput ribosome profiling data and ribosome binding translation initiation sequence region into one model. DeepRibo is designed as a single model trained on a variety of ribosome profiling experiments, used for the identification of open reading frames in prokaryotes without a priori knowledge of the translational landscape. Through extensive validation of the model trained on various sets of data, multiple species sequence similarity, mass spectrometry and Edman degradation verified proteins, the effectiveness of DeepRibo is highlighted.


Algorithms , Molecular Sequence Annotation/methods , Prokaryotic Cells/metabolism , Protein Biosynthesis/physiology , Ribosomes/metabolism , Binding Sites , Computational Biology/methods , Datasets as Topic , High-Throughput Screening Assays/methods , Neural Networks, Computer , Open Reading Frames , Prokaryotic Cells/chemistry , Protein Processing, Post-Translational , Sequence Alignment/methods , Signal Transduction
...