Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 115
Filter
1.
Nat Biotechnol ; 2024 Apr 15.
Article in English | MEDLINE | ID: mdl-38622344

ABSTRACT

Citizen science video games are designed primarily for users already inclined to contribute to science, which severely limits their accessibility for an estimated community of 3 billion gamers worldwide. We created Borderlands Science (BLS), a citizen science activity that is seamlessly integrated within a popular commercial video game played by tens of millions of gamers. This integration is facilitated by a novel game-first design of citizen science games, in which the game design aspect has the highest priority, and a suitable task is then mapped to the game design. BLS crowdsources a multiple alignment task of 1 million 16S ribosomal RNA sequences obtained from human microbiome studies. Since its initial release on 7 April 2020, over 4 million players have solved more than 135 million science puzzles, a task unsolvable by a single individual. Leveraging these results, we show that our multiple sequence alignment simultaneously improves microbial phylogeny estimations and UniFrac effect sizes compared to state-of-the-art computational methods. This achievement demonstrates that hyper-gamified scientific tasks attract massive crowds of contributors and offers invaluable resources to the scientific community.

2.
iScience ; 27(2): 109002, 2024 Feb 16.
Article in English | MEDLINE | ID: mdl-38362268

ABSTRACT

This study focuses on enhancing the prediction of regulatory functional sites in DNA and RNA sequences, a crucial aspect of gene regulation. Current methods, such as motif overrepresentation and machine learning, often lack specificity. To address this issue, the study leverages evolutionary information and introduces Graphylo, a deep-learning approach for predicting transcription factor binding sites in the human genome. Graphylo combines Convolutional Neural Networks for DNA sequences with Graph Convolutional Networks on phylogenetic trees, using information from placental mammals' genomes and evolutionary history. The research demonstrates that Graphylo consistently outperforms both single-species deep learning techniques and methods that incorporate inter-species conservation scores on a wide range of datasets. It achieves this by utilizing a species-based attention model for evolutionary insights and an integrated gradient approach for nucleotide-level model interpretability. This innovative approach offers a promising avenue for improving the accuracy of regulatory site prediction in genomics.

3.
Bioinformatics ; 40(2)2024 02 01.
Article in English | MEDLINE | ID: mdl-38291894

ABSTRACT

MOTIVATION: Up to 75% of the human genome encodes RNAs. The function of many non-coding RNAs relies on their ability to fold into 3D structures. Specifically, nucleotides inside secondary structure loops form non-canonical base pairs that help stabilize complex local 3D structures. These RNA 3D motifs can promote specific interactions with other molecules or serve as catalytic sites. RESULTS: We introduce PERFUMES, a computational pipeline to identify 3D motifs that can be associated with observable features. Given a set of RNA sequences with associated binary experimental measurements, PERFUMES searches for RNA 3D motifs using BayesPairing2 and extracts those that are over-represented in the set of positive sequences. It also conducts a thermodynamics analysis of the structural context that can support the interpretation of the predictions. We illustrate PERFUMES' usage on the SNRPA protein binding site, for which the tool retrieved both previously known binder motifs and new ones. AVAILABILITY AND IMPLEMENTATION: PERFUMES is an open-source Python package (https://jwgitlab.cs.mcgill.ca/arnaud_chol/perfumes).


Subject(s)
Perfume , Humans , Nucleic Acid Conformation , Nucleotide Motifs , Base Pairing , RNA/chemistry
4.
bioRxiv ; 2023 Nov 28.
Article in English | MEDLINE | ID: mdl-38116029

ABSTRACT

Polycomb Repressive Complex 2 (PRC2)-mediated histone H3K27 tri-methylation (H3K27me3) recruits canonical PRC1 (cPRC1) to maintain heterochromatin. In early development, polycomb-regulated genes are connected through long-range 3D interactions which resolve upon differentiation. Here, we report that polycomb looping is controlled by H3K27me3 spreading and regulates target gene silencing and cell fate specification. Using glioma-derived H3 Lys-27-Met (H3K27M) mutations as tools to restrict H3K27me3 deposition, we show that H3K27me3 confinement concentrates the chromatin pool of cPRC1, resulting in heightened 3D interactions mirroring chromatin architecture of pluripotency, and stringent gene repression that maintains cells in progenitor states to facilitate tumor development. Conversely, H3K27me3 spread in pluripotent stem cells, following neural differentiation or loss of the H3K36 methyltransferase NSD1, dilutes cPRC1 concentration and dissolves polycomb loops. These results identify the regulatory principles and disease implications of polycomb looping and nominate histone modification-guided distribution of reader complexes as an important mechanism for nuclear compartment organization. Highlights: The confinement of H3K27me3 at PRC2 nucleation sites without its spreading correlates with increased 3D chromatin interactions.The H3K27M oncohistone concentrates canonical PRC1 that anchors chromatin loop interactions in gliomas, silencing developmental programs.Stem and progenitor cells require factors promoting H3K27me3 confinement, including H3K36me2, to maintain cPRC1 loop architecture.The cPRC1-H3K27me3 interaction is a targetable driver of aberrant self-renewal in tumor cells.

5.
Bioinformatics ; 39(39 Suppl 1): i386-i393, 2023 06 30.
Article in English | MEDLINE | ID: mdl-37387127

ABSTRACT

MOTIVATION: Accurately assessing contacts between DNA fragments inside the nucleus with Hi-C experiment is crucial for understanding the role of 3D genome organization in gene regulation. This challenging task is due in part to the high sequencing depth of Hi-C libraries required to support high-resolution analyses. Most existing Hi-C data are collected with limited sequencing coverage, leading to poor chromatin interaction frequency estimation. Current computational approaches to enhance Hi-C signals focus on the analysis of individual Hi-C datasets of interest, without taking advantage of the facts that (i) several hundred Hi-C contact maps are publicly available and (ii) the vast majority of local spatial organizations are conserved across multiple cell types. RESULTS: Here, we present RefHiC-SR, an attention-based deep learning framework that uses a reference panel of Hi-C datasets to facilitate the enhancement of Hi-C data resolution of a given study sample. We compare RefHiC-SR against tools that do not use reference samples and find that RefHiC-SR outperforms other programs across different cell types, and sequencing depths. It also enables high-accuracy mapping of structures such as loops and topologically associating domains. AVAILABILITY AND IMPLEMENTATION: https://github.com/BlanchetteLab/RefHiC.


Subject(s)
Cell Nucleus , Libraries , Chromatin/genetics
6.
Front Bioinform ; 3: 1285828, 2023.
Article in English | MEDLINE | ID: mdl-38455089

ABSTRACT

Hi-C is one of the most widely used approaches to study three-dimensional genome conformations. Contacts captured by a Hi-C experiment are represented in a contact frequency matrix. Due to the limited sequencing depth and other factors, Hi-C contact frequency matrices are only approximations of the true interaction frequencies and are further reported without any quantification of uncertainty. Hence, downstream analyses based on Hi-C contact maps (e.g., TAD and loop annotation) are themselves point estimations. Here, we present the Hi-C interaction frequency sampler (HiCSampler) that reliably infers the posterior distribution of the interaction frequency for a given Hi-C contact map by exploiting dependencies between neighboring loci. Posterior predictive checks demonstrate that HiCSampler can infer highly predictive chromosomal interaction frequency. Summary statistics calculated by HiCSampler provide a measurement of the uncertainty for Hi-C experiments, and samples inferred by HiCSampler are ready for use by most downstream analysis tools off the shelf and permit uncertainty measurements in these analyses without modifications.

7.
Nat Genet ; 54(12): 1865-1880, 2022 12.
Article in English | MEDLINE | ID: mdl-36471070

ABSTRACT

Canonical (H3.1/H3.2) and noncanonical (H3.3) histone 3 K27M-mutant gliomas have unique spatiotemporal distributions, partner alterations and molecular profiles. The contribution of the cell of origin to these differences has been challenging to uncouple from the oncogenic reprogramming induced by the mutation. Here, we perform an integrated analysis of 116 tumors, including single-cell transcriptome and chromatin accessibility, 3D chromatin architecture and epigenomic profiles, and show that K27M-mutant gliomas faithfully maintain chromatin configuration at developmental genes consistent with anatomically distinct oligodendrocyte precursor cells (OPCs). H3.3K27M thalamic gliomas map to prosomere 2-derived lineages. In turn, H3.1K27M ACVR1-mutant pontine gliomas uniformly mirror early ventral NKX6-1+/SHH-dependent brainstem OPCs, whereas H3.3K27M gliomas frequently resemble dorsal PAX3+/BMP-dependent progenitors. Our data suggest a context-specific vulnerability in H3.1K27M-mutant SHH-dependent ventral OPCs, which rely on acquisition of ACVR1 mutations to drive aberrant BMP signaling required for oncogenesis. The unifying action of K27M mutations is to restrict H3K27me3 at PRC2 landing sites, whereas other epigenetic changes are mainly contingent on the cell of origin chromatin state and cycling rate.


Subject(s)
Chromatin , Epigenomics , Cell Lineage/genetics , Brain
8.
BMC Cancer ; 22(1): 1297, 2022 Dec 12.
Article in English | MEDLINE | ID: mdl-36503484

ABSTRACT

BACKGROUND: Juvenile Pilocytic Astrocytomas (JPAs) are one of the most common pediatric brain tumors, and they are driven by aberrant activation of the mitogen-activated protein kinase (MAPK) signaling pathway. RAF-fusions are the most common genetic alterations identified in JPAs, with the prototypical KIAA1549-BRAF fusion leading to loss of BRAF's auto-inhibitory domain and subsequent constitutive kinase activation. JPAs are highly vascular and show pervasive immune infiltration, which can lead to low tumor cell purity in clinical samples. This can result in gene fusions that are difficult to detect with conventional omics approaches including RNA-Seq. METHODS: To this effect, we applied RNA-Seq as well as linked-read whole-genome sequencing and in situ Hi-C as new approaches to detect and characterize low-frequency gene fusions at the genomic, transcriptomic and spatial level. RESULTS: Integration of these datasets allowed the identification and detailed characterization of two novel BRAF fusion partners, PTPRZ1 and TOP2B, in addition to the canonical fusion with partner KIAA1549. Additionally, our Hi-C datasets enabled investigations of 3D genome architecture in JPAs which showed a high level of correlation in 3D compartment annotations between JPAs compared to other pediatric tumors, and high similarity to normal adult astrocytes. We detected interactions between BRAF and its fusion partners exclusively in tumor samples containing BRAF fusions. CONCLUSIONS: We demonstrate the power of integrating multi-omic datasets to identify low frequency fusions and characterize the JPA genome at high resolution. We suggest that linked-reads and Hi-C could be used in clinic for the detection and characterization of JPAs.


Subject(s)
Astrocytoma , Brain Neoplasms , Child , Adult , Humans , Multiomics , Proto-Oncogene Proteins B-raf/genetics , Oncogene Proteins, Fusion/genetics , Astrocytoma/pathology , Brain Neoplasms/pathology , Receptor-Like Protein Tyrosine Phosphatases, Class 5
9.
Nat Commun ; 13(1): 7426, 2022 Dec 02.
Article in English | MEDLINE | ID: mdl-36460680

ABSTRACT

Accurately annotating topological structures (e.g., loops and topologically associating domains) from Hi-C data is critical for understanding the role of 3D genome organization in gene regulation. This is a challenging task, especially at high resolution, in part due to the limited sequencing coverage of Hi-C data. Current approaches focus on the analysis of individual Hi-C data sets of interest, without taking advantage of the facts that (i) several hundred Hi-C contact maps are publicly available, and (ii) the vast majority of topological structures are conserved across multiple cell types. Here, we present RefHiC, an attention-based deep learning framework that uses a reference panel of Hi-C datasets to facilitate topological structure annotation from a given study sample. We compare RefHiC against tools that do not use reference samples and find that RefHiC outperforms other programs at both topological associating domain and loop annotation across different cell types, species, and sequencing depths.

10.
Cell Rep ; 40(3): 111121, 2022 07 19.
Article in English | MEDLINE | ID: mdl-35858561

ABSTRACT

Leishmania are eukaryotic parasites that have retained the ability to produce extracellular vesicles (EVs) through evolution. To date, it has been unclear if different DNA entities could be associated with Leishmania EVs and whether these could constitute a mechanism of horizontal gene transfer (HGT). Herein, we investigate the DNA content of EVs derived from drug-resistant parasites, as well as the EVs' potential to act as shuttles for DNA transfer. Next-generation sequencing and PCR assays confirm the enrichment of amplicons carrying drug-resistance genes associated with EVs. Transfer assays of drug-resistant EVs highlight a significant impact on the phenotype of recipient parasites induced by the expression of the transferred DNA. Recipient parasites display an enhanced growth and better control of oxidative stress. We provide evidence that eukaryotic EVs function as efficient mediators in HGT, thereby facilitating the transmission of drug-resistance genes and increasing the fitness of cells when encountering stressful environments.


Subject(s)
Extracellular Vesicles , Leishmania , Parasites , Animals , Drug Resistance/genetics , Eukaryota , Extracellular Vesicles/metabolism , Leishmania/genetics , Leishmania/metabolism
11.
Bioinformatics ; 38(Suppl 1): i299-i306, 2022 06 24.
Article in English | MEDLINE | ID: mdl-35758792

ABSTRACT

MOTIVATION: The computational prediction of regulatory function associated with a genomic sequence is of utter importance in -omics study, which facilitates our understanding of the underlying mechanisms underpinning the vast gene regulatory network. Prominent examples in this area include the binding prediction of transcription factors in DNA regulatory regions, and predicting RNA-protein interaction in the context of post-transcriptional gene expression. However, existing computational methods have suffered from high false-positive rates and have seldom used any evolutionary information, despite the vast amount of available orthologous data across multitudes of extant and ancestral genomes, which readily present an opportunity to improve the accuracy of existing computational methods. RESULTS: In this study, we present a novel probabilistic approach called PhyloPGM that leverages previously trained TFBS or RNA-RBP binding predictors by aggregating their predictions from various orthologous regions, in order to boost the overall prediction accuracy on human sequences. Throughout our experiments, PhyloPGM has shown significant improvement over baselines such as the sequence-based RNA-RBP binding predictor RNATracker and the sequence-based TFBS predictor that is known as FactorNet. PhyloPGM is simple in principle, easy to implement and yet, yields impressive results. AVAILABILITY AND IMPLEMENTATION: The PhyloPGM package is available at https://github.com/BlanchetteLab/PhyloPGM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics , Regulatory Sequences, Nucleic Acid , DNA , Genomics/methods , Humans , RNA , Sequence Analysis, DNA/methods
12.
Genetics ; 221(3)2022 07 04.
Article in English | MEDLINE | ID: mdl-35552404

ABSTRACT

Sequences derived from the Long INterspersed Element-1 (L1) family of retrotransposons occupy at least 17% of the human genome, with 67 distinct subfamilies representing successive waves of expansion and extinction in mammalian lineages. L1s contribute extensively to gene regulation, but their molecular history is difficult to trace, because most are present only as truncated and highly mutated fossils. Consequently, L1 entries in current databases of repeat sequences are composed mainly of short diagnostic subsequences, rather than full functional progenitor sequences for each subfamily. Here, we have coupled 2 levels of sequence reconstruction (at the level of whole genomes and L1 subfamilies) to reconstruct progenitor sequences for all human L1 subfamilies that are more functionally and phylogenetically plausible than existing models. Most of the reconstructed sequences are at or near the canonical length of L1s and encode uninterrupted ORFs with expected protein domains. We also show that the presence or absence of binding sites for KRAB-C2H2 Zinc Finger Proteins, even in ancient-reconstructed progenitor L1s, mirrors binding observed in human ChIP-exo experiments, thus extending the arms race and domestication model. RepeatMasker searches of the modern human genome suggest that the new models may be able to assign subfamily resolution identities to previously ambiguous L1 instances. The reconstructed L1 sequences will be useful for genome annotation and functional study of both L1 evolution and L1 contributions to host regulatory networks.


Subject(s)
Long Interspersed Nucleotide Elements , Retroelements , Animals , Evolution, Molecular , Genome, Human , Humans , Mammals/genetics , Open Reading Frames , Phylogeny , Repetitive Sequences, Nucleic Acid , Retroelements/genetics
13.
Front Physiol ; 12: 683651, 2021.
Article in English | MEDLINE | ID: mdl-34381375

ABSTRACT

BACKGROUND: Angiopoietin-1 (Ang-1) is the main ligand of Tie-2 receptors. It promotes endothelial cell (EC) survival, migration, and differentiation. Little is known about the transcription factors (TFs) in ECs that are downstream from Tie-2 receptors. OBJECTIVE: The main objective of this study is to identify the roles of the ETS family of TFs in Ang-1 signaling and the angiogenic response. METHODS: In silico enrichment analyses that were designed to predict TF binding sites of the promotors of eighty-six Ang-1-upregulated genes showed significant enrichment of ETS1, ELK1, and ETV4 binding sites in ECs. Human umbilical vein endothelial cells (HUVECs) were exposed for different time periods to recombinant Ang-1 protein and mRNA levels of ETS1, ELK1, and ETV4 were measured with qPCR and intracellular localization of these transcription factors was assessed with immunofluorescence. Electrophoretic mobility shift assays and reporter assays were used to assess activation of ETS1, ELK1, and ETV4 in response to Ang-1 exposure. The functional roles of these TFs in Ang-1-induced endothelial cell survival, migration, differentiation, and gene regulation were evaluated by using a loss-of-function approach (transfection with siRNA oligos). RESULTS: Ang-1 exposure increased ETS1 mRNA levels but had no effect on ELK1 or ETV4 levels. Immunostaining revealed that in control ECs, ETS1 has nuclear localization whereas ELK1 and ETV4 are localized to the nucleus and the cytosol. Ang-1 exposure increased nuclear intensity of ETS1 protein and enhanced nuclear mobilization of ELK1 and ETV4. Selective siRNA knockdown of ETS1, ELK1, and ETV4 showed that these TFs are required for Ang-1-induced EC survival and differentiation of cells, while ETS1 and ETV4 are required for Ang-1-induced EC migration. Moreover, ETS1, ELK1, and ETV4 knockdown inhibited Ang-1-induced upregulation of thirteen, eight, and nine pro-angiogenesis genes, respectively. CONCLUSION: We conclude that ETS1, ELK1, and ETV4 transcription factors play significant angiogenic roles in Ang-1 signaling in ECs.

15.
Methods Mol Biol ; 2157: 127-157, 2021.
Article in English | MEDLINE | ID: mdl-32820402

ABSTRACT

Chromatin immunoprecipitation (ChIP) is used to probe the presence of proteins and/or their posttranslational modifications on genomic DNA. This method is often used alongside chromosome conformation capture approaches to obtain a better-rounded view of the functional relationship between chromatin architecture and its landscape. Since the inception of ChIP, its protocol has been modified to improve speed, sensitivity, and specificity. Combining ChIP with deep sequencing has recently improved its throughput and made genome-wide profiling possible. However, genome-wide analysis is not always the best option, particularly when many samples are required to study a given genomic region or when quantitative data is desired. We recently developed carbon copy-ChIP (2C-ChIP), a new form of the high-throughput ChIP analysis method ideally suited for these types of studies. 2C-ChIP applies ligation-mediated amplification (LMA) followed by deep sequencing to quantitatively detect specified genomic regions in ChIP samples. Here, we describe the generation of 2C-ChIP libraries and computational processing of the resulting sequencing data.


Subject(s)
Chromatin/metabolism , High-Throughput Nucleotide Sequencing/methods , Animals , Chromatin Immunoprecipitation , Epigenomics/methods , Humans , Protein Processing, Post-Translational , Sequence Analysis, DNA
16.
Bioinformatics ; 36(Suppl_2): i895-i902, 2020 12 30.
Article in English | MEDLINE | ID: mdl-33381838

ABSTRACT

MOTIVATION: The ability to develop robust machine-learning (ML) models is considered imperative to the adoption of ML techniques in biology and medicine fields. This challenge is particularly acute when data available for training is not independent and identically distributed (iid), in which case trained models are vulnerable to out-of-distribution generalization problems. Of particular interest are problems where data correspond to observations made on phylogenetically related samples (e.g. antibiotic resistance data). RESULTS: We introduce DendroNet, a new approach to train neural networks in the context of evolutionary data. DendroNet explicitly accounts for the relatedness of the training/testing data, while allowing the model to evolve along the branches of the phylogenetic tree, hence accommodating potential changes in the rules that relate genotypes to phenotypes. Using simulated data, we demonstrate that DendroNet produces models that can be significantly better than non-phylogenetically aware approaches. DendroNet also outperforms other approaches at two biological tasks of significant practical importance: antiobiotic resistance prediction in bacteria and trophic level prediction in fungi. AVAILABILITY AND IMPLEMENTATION: https://github.com/BlanchetteLab/DendroNet.


Subject(s)
Machine Learning , Neural Networks, Computer , Phylogeny , Supervised Machine Learning
17.
Cell ; 183(6): 1617-1633.e22, 2020 12 10.
Article in English | MEDLINE | ID: mdl-33259802

ABSTRACT

Histone H3.3 glycine 34 to arginine/valine (G34R/V) mutations drive deadly gliomas and show exquisite regional and temporal specificity, suggesting a developmental context permissive to their effects. Here we show that 50% of G34R/V tumors (n = 95) bear activating PDGFRA mutations that display strong selection pressure at recurrence. Although considered gliomas, G34R/V tumors actually arise in GSX2/DLX-expressing interneuron progenitors, where G34R/V mutations impair neuronal differentiation. The lineage of origin may facilitate PDGFRA co-option through a chromatin loop connecting PDGFRA to GSX2 regulatory elements, promoting PDGFRA overexpression and mutation. At the single-cell level, G34R/V tumors harbor dual neuronal/astroglial identity and lack oligodendroglial programs, actively repressed by GSX2/DLX-mediated cell fate specification. G34R/V may become dispensable for tumor maintenance, whereas mutant-PDGFRA is potently oncogenic. Collectively, our results open novel research avenues in deadly tumors. G34R/V gliomas are neuronal malignancies where interneuron progenitors are stalled in differentiation by G34R/V mutations and malignant gliogenesis is promoted by co-option of a potentially targetable pathway, PDGFRA signaling.


Subject(s)
Brain Neoplasms/genetics , Carcinogenesis/genetics , Glioma/genetics , Histones/genetics , Interneurons/metabolism , Mutation/genetics , Neural Stem Cells/metabolism , Receptor, Platelet-Derived Growth Factor alpha/genetics , Animals , Astrocytes/metabolism , Astrocytes/pathology , Brain Neoplasms/pathology , Carcinogenesis/pathology , Cell Lineage , Cellular Reprogramming/genetics , Chromatin/metabolism , Embryo, Mammalian/metabolism , Epigenesis, Genetic , Gene Expression Regulation, Neoplastic , Gene Silencing , Glioma/pathology , Histones/metabolism , Lysine/metabolism , Mice, Inbred C57BL , Models, Biological , Neoplasm Grading , Oligodendroglia/metabolism , Promoter Regions, Genetic/genetics , Prosencephalon/embryology , Receptor, Platelet-Derived Growth Factor alpha/metabolism , Transcription, Genetic , Transcriptome/genetics
18.
Bioinformatics ; 36(Suppl_1): i353-i361, 2020 07 01.
Article in English | MEDLINE | ID: mdl-32657367

ABSTRACT

MOTIVATION: Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate sequence evolution is also at the core of many benchmarking strategies. Yet, mutational processes have complex context dependencies that remain poorly modeled and understood. RESULTS: We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence long short-term memory model trained to predict mutation probabilities at each position of a given sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate mammalian and plant DNA sequence evolution and reveals unexpectedly strong long-range context dependencies in mutation probabilities. EvoLSTM brings modern machine-learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes. AVAILABILITY AND IMPLEMENTATION: Code and dataset are available at https://github.com/DongjoonLim/EvoLSTM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Machine Learning , Neural Networks, Computer , Benchmarking , Phylogeny , Sequence Alignment , Software
19.
Bioinformatics ; 36(Suppl_1): i276-i284, 2020 07 01.
Article in English | MEDLINE | ID: mdl-32657407

ABSTRACT

MOTIVATION: RNA-protein interactions are key effectors of post-transcriptional regulation. Significant experimental and bioinformatics efforts have been expended on characterizing protein binding mechanisms on the molecular level, and on highlighting the sequence and structural traits of RNA that impact the binding specificity for different proteins. Yet our ability to predict these interactions in silico remains relatively poor. RESULTS: In this study, we introduce RPI-Net, a graph neural network approach for RNA-protein interaction prediction. RPI-Net learns and exploits a graph representation of RNA molecules, yielding significant performance gains over existing state-of-the-art approaches. We also introduce an approach to rectify an important type of sequence bias caused by the RNase T1 enzyme used in many CLIP-Seq experiments, and we show that correcting this bias is essential in order to learn meaningful predictors and properly evaluate their accuracy. Finally, we provide new approaches to interpret the trained models and extract simple, biologically interpretable representations of the learned sequence and structural motifs. AVAILABILITY AND IMPLEMENTATION: Source code can be accessed at https://www.github.com/HarveyYan/RNAonGraph. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Neural Networks, Computer , RNA , Protein Binding , Protein Structure, Secondary , RNA/metabolism , Software
20.
BMC Res Notes ; 13(1): 273, 2020 Jun 03.
Article in English | MEDLINE | ID: mdl-32493406

ABSTRACT

OBJECTIVE: Ligation-Mediated Amplification (LMA) is a versatile biochemical tool for amplifying selected DNA sequences. LMA has increased in popularity due to its integration within chromosome conformation capture (5C) and chromatin immunoprecipitation (2C-ChIP) methodologies. The output of either 5C or 2C-ChIP protocols is a single-read sequencing library of ligated primer pairs that may or may not be multiplexed. While many computational tools currently exist for read mapping and analysis, these tools neither fully support multiplexed libraries nor provide qualitative reporting on the LMA primers involved. Typically, the task of library demultiplexing or primer analysis is offloaded on to the user. Our aim was to develop an easy-to-use pipeline for processing (multiplexed) single-read sequencing data produced by sequence-specific LMA. RESULTS: Here, we describe the Ligation-mediated Amplified, Multiplexed Primer-pair Sequence (LAMPS) analysis pipeline. LAMPS facilitates the analysis of multiplexed LMA sequencing data and provides a thorough assessment of a library's reads for a variety of experimental parameters (e.g., primer-pair efficiency). The standardized output of LAMPS allows for easy integration with downstream analyses, such as data track visualization on a genome browser. LAMPS is made publicly available on GitHub: https://github.com/BlanchetteLab/LAMPS.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Nucleic Acid Amplification Techniques/methods , Sequence Analysis, DNA/methods , Chromatin Immunoprecipitation , Gene Library , Humans , Multiplex Polymerase Chain Reaction/methods , Quality Control
SELECTION OF CITATIONS
SEARCH DETAIL
...