Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 186
Filter
1.
Genome Res ; 2024 Sep 25.
Article in English | MEDLINE | ID: mdl-39322282

ABSTRACT

Most studies of genome organization have focused on intrachromosomal (cis) contacts because they harbor key features such as DNA loops and topologically associating domains. Interchromosomal (trans) contacts have received much less attention, and tools for interrogating potential biologically relevant trans structures are lacking. Here, we develop a computational framework that uses Hi-C data to identify sets of loci that jointly interact in trans This method, trans-C, initiates probabilistic random walks with restarts from a set of seed loci to traverse an input Hi-C contact network, thereby identifying sets of trans-contacting loci. We validate trans-C in three increasingly complex models of established trans contacts: the Plasmodium falciparum var genes, the mouse olfactory receptor "Greek islands", and the human RBM20 cardiac splicing factory. We then apply trans-C to systematically test the hypothesis that genes coregulated by the same trans-acting element (i.e., a transcription or splicing factor) colocalize in three dimensions to form "RNA factories" that maximize the efficiency and accuracy of RNA biogenesis. We find that many loci with multiple binding sites of the same DNA binding proteins interact with one another in trans, especially those bound by factors with intrinsically disordered domains. Similarly, clustered binding of a subset of RNA-binding proteins correlates with trans interaction of the encoding loci. Intriguingly, we observe that these trans-interacting loci are close to nuclear speckles. Our findings support the existence of trans interacting chromatin domains (TIDs) driven by RNA biogenesis. Trans-C provides an efficient computational framework for studying these and other types of trans interactions, empowering studies of a poorly understood aspect of genome architecture.

2.
bioRxiv ; 2024 Sep 01.
Article in English | MEDLINE | ID: mdl-39257739

ABSTRACT

The cell cycle governs the proliferation, differentiation, and regeneration of all eukaryotic cells. Profiling cell cycle dynamics is therefore central to basic and biomedical research spanning development, health, aging, and disease. However, current approaches to cell cycle profiling involve complex interventions that may confound experimental interpretation. To facilitate more efficient cell cycle annotation of microscopy data, we developed CellCycleNet, a machine learning (ML) workflow designed to simplify cell cycle staging with minimal experimenter intervention and cost. CellCycleNet accurately predicts cell cycle phase using only a fluorescent nuclear stain (DAPI) in fixed interphase cells. Using the Fucci2a cell cycle reporter system as ground truth, we collected two benchmarking image datasets and trained two ML models-a support vector machine (SVM) and a deep neural network-to classify nuclei as being in either the G1 or S/G2 phases of the cell cycle. Our results suggest that CellCycleNet outperforms state-of-the-art SVM models on each dataset individually. When trained on two image datasets simultaneously, CellCycleNet achieves the highest classification accuracy, with an improvement in AUROC of 0.08-0.09. The model also demonstrates excellent generalization across different microscopes, achieving an AUROC of 0.95. Overall, using features derived from 3D images, rather than 2D projections of those same images, significantly improves classification performance. We have released our image data, trained models, and software as a community resource.

3.
J Proteome Res ; 2024 Aug 30.
Article in English | MEDLINE | ID: mdl-39213590

ABSTRACT

A key parameter of any bottom-up proteomics mass spectrometry experiment is the identity of the enzyme that is used to digest proteins in the sample into peptides. The Casanovo de novo sequencing model was trained using data that was generated with trypsin digestion; consequently, the model prefers to predict peptides that end with the amino acids "K" or "R". This bias is desirable when Casanovo is used to analyze data that was also generated using trypsin but can be problematic if the data was generated using some other digestion enzyme. In this work, we modify Casanovo to take as input the identity of the digestion enzyme alongside each observed spectrum. We then train Casanovo with data generated by using several different enzymes, and we demonstrate that the resulting model successfully learns to capture enzyme-specific behavior. However, we find, surprisingly, that this new model does not yield a significant improvement in sequencing accuracy relative to a model trained without enzyme information but using the same training set. This observation may have important implications for future attempts to make use of experimental metadata in de novo sequencing models.

4.
Nat Commun ; 15(1): 6427, 2024 Jul 30.
Article in English | MEDLINE | ID: mdl-39080256

ABSTRACT

A fundamental challenge in mass spectrometry-based proteomics is the identification of the peptide that generated each acquired tandem mass spectrum. Approaches that leverage known peptide sequence databases cannot detect unexpected peptides and can be impractical or impossible to apply in some settings. Thus, the ability to assign peptide sequences to tandem mass spectra without prior information-de novo peptide sequencing-is valuable for tasks including antibody sequencing, immunopeptidomics, and metaproteomics. Although many methods have been developed to address this problem, it remains an outstanding challenge in part due to the difficulty of modeling the irregular data structure of tandem mass spectra. Here, we describe Casanovo, a machine learning model that uses a transformer neural network architecture to translate the sequence of peaks in a tandem mass spectrum into the sequence of amino acids that comprise the generating peptide. We train a Casanovo model from 30 million labeled spectra and demonstrate that the model outperforms several state-of-the-art methods on a cross-species benchmark dataset. We also develop a version of Casanovo that is fine-tuned for non-enzymatic peptides. Finally, we demonstrate that Casanovo's superior performance improves the analysis of immunopeptidomics and metaproteomics experiments and allows us to delve deeper into the dark proteome.


Subject(s)
Peptides , Proteomics , Tandem Mass Spectrometry , Peptides/chemistry , Peptides/metabolism , Tandem Mass Spectrometry/methods , Proteomics/methods , Neural Networks, Computer , Machine Learning , Humans , Amino Acid Sequence , Sequence Analysis, Protein/methods , Databases, Protein , Algorithms
5.
Genome Res ; 2024 Jun 07.
Article in English | MEDLINE | ID: mdl-38849157

ABSTRACT

Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation as well as the identification of exogenously placed DNA N6-methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as coprocessing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce fibertools, a state-of-the-art toolkit that features a semisupervised convolutional neural network for fast and accurate identification of m6A-marked bases using PacBio single-molecule long-read sequencing, as well as the coprocessing of long-read genetic and epigenetic data produced using either PacBio or Oxford Nanopore sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kilobase long DNA molecules with a ~1,000-fold improvement in speed. In addition, we demonstrate that fibertools can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.

6.
Bioinformatics ; 40(Suppl 1): i410-i417, 2024 06 28.
Article in English | MEDLINE | ID: mdl-38940129

ABSTRACT

MOTIVATION: One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools. RESULTS: To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.


Subject(s)
Databases, Protein , Peptides , Peptides/chemistry , Machine Learning , Mass Spectrometry/methods , Algorithms , Sequence Analysis, Protein/methods , Tandem Mass Spectrometry/methods
8.
J Proteome Res ; 23(6): 1907-1914, 2024 Jun 07.
Article in English | MEDLINE | ID: mdl-38687997

ABSTRACT

Traditional database search methods for the analysis of bottom-up proteomics tandem mass spectrometry (MS/MS) data are limited in their ability to detect peptides with post-translational modifications (PTMs). Recently, "open modification" database search strategies, in which the requirement that the mass of the database peptide closely matches the observed precursor mass is relaxed, have become popular as ways to find a wider variety of types of PTMs. Indeed, in one study, Kong et al. reported that the open modification search tool MSFragger can achieve higher statistical power to detect peptides than a traditional "narrow window" database search. We investigated this claim empirically and, in the process, uncovered a potential general problem with false discovery rate (FDR) control in the machine learning postprocessors Percolator and PeptideProphet. This problem might have contributed to Kong et al.'s report that their empirical results suggest that false discovery (FDR) control in the narrow window setting might generally be compromised. Indeed, reanalyzing the same data while using a more standard form of target-decoy competition-based FDR control, we found that, after accounting for chimeric spectra as well as for the inherent difference in the number of candidates in open and narrow searches, the data does not provide sufficient evidence that FDR control in proteomics MS/MS database search is inherently problematic.


Subject(s)
Databases, Protein , Protein Processing, Post-Translational , Proteomics , Tandem Mass Spectrometry , Tandem Mass Spectrometry/methods , Proteomics/methods , Peptides/analysis , Peptides/chemistry , Machine Learning , Humans , Algorithms , Software
9.
iScience ; 27(5): 109570, 2024 May 17.
Article in English | MEDLINE | ID: mdl-38646172

ABSTRACT

The three-dimensional organization of genomes plays a crucial role in essential biological processes. The segregation of chromatin into A and B compartments highlights regions of activity and inactivity, providing a window into the genomic activities specific to each cell type. Yet, the steep costs associated with acquiring Hi-C data, necessary for studying this compartmentalization across various cell types, pose a significant barrier in studying cell type specific genome organization. To address this, we present a prediction tool called compartment prediction using recurrent neural networks (CoRNN), which predicts compartmentalization of 3D genome using histone modification enrichment. CoRNN demonstrates robust cross-cell-type prediction of A/B compartments with an average AuROC of 90.9%. Cell-type-specific predictions align well with known functional elements, with H3K27ac and H3K36me3 identified as highly predictive histone marks. We further investigate our mispredictions and found that they are located in regions with ambiguous compartmental status. Furthermore, our model's generalizability is validated by predicting compartments in independent tissue samples, which underscores its broad applicability.

10.
bioRxiv ; 2024 Apr 14.
Article in English | MEDLINE | ID: mdl-38645064

ABSTRACT

Over the past 15 years, a variety of next-generation sequencing assays have been developed for measuring the 3D conformation of DNA in the nucleus. Each of these assays gives, for a particular cell or tissue type, a distinct picture of 3D chromatin architecture. Accordingly, making sense of the relationship between genome structure and function requires teasing apart two closely related questions: how does chromatin 3D structure change from one cell type to the next, and how do different measurements of that structure differ from one another, even when the two assays are carried out in the same cell type? In this work, we assemble a collection of chromatin 3D datasets-each represented as a 2D contact map- spanning multiple assay types and cell types. We then build a machine learning model that predicts missing contact maps in this collection. We use the model to systematically explore how genome 3D architecture changes, at the level of compartments, domains, and loops, between cell type and between assay types.

11.
bioRxiv ; 2024 Sep 13.
Article in English | MEDLINE | ID: mdl-38496477

ABSTRACT

The emergence of single-cell time-series datasets enables modeling of changes in various types of cellular profiles over time. However, due to the disruptive nature of single-cell measurements, it is impossible to capture the full temporal trajectory of a particular cell. Furthermore, single-cell profiles can be collected at mismatched time points across different conditions (e.g., sex, batch, disease) and data modalities (e.g., scRNA-seq, scATAC-seq), which makes modeling challenging. Here we propose a joint modeling framework, Sunbear, for integrating multi-condition and multi-modal single-cell profiles across time. Sunbear can be used to impute single-cell temporal profile changes, align multi-dataset and multi-modal profiles across time, and extrapolate single-cell profiles in a missing modality. We applied Sunbear to reveal sex-biased transcription during mouse embryonic development and predict dynamic relationships between epigenetic priming and transcription for cells in which multi-modal profiles are unavailable. Sunbear thus enables the projection of single-cell time-series snapshots to multi-modal and multi-condition views of cellular trajectories.

12.
Nature ; 626(8001): 1084-1093, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38355799

ABSTRACT

The house mouse (Mus musculus) is an exceptional model system, combining genetic tractability with close evolutionary affinity to humans1,2. Mouse gestation lasts only 3 weeks, during which the genome orchestrates the astonishing transformation of a single-cell zygote into a free-living pup composed of more than 500 million cells. Here, to establish a global framework for exploring mammalian development, we applied optimized single-cell combinatorial indexing3 to profile the transcriptional states of 12.4 million nuclei from 83 embryos, precisely staged at 2- to 6-hour intervals spanning late gastrulation (embryonic day 8) to birth (postnatal day 0). From these data, we annotate hundreds of cell types and explore the ontogenesis of the posterior embryo during somitogenesis and of kidney, mesenchyme, retina and early neurons. We leverage the temporal resolution and sampling depth of these whole-embryo snapshots, together with published data4-8 from earlier timepoints, to construct a rooted tree of cell-type relationships that spans the entirety of prenatal development, from zygote to birth. Throughout this tree, we systematically nominate genes encoding transcription factors and other proteins as candidate drivers of the in vivo differentiation of hundreds of cell types. Remarkably, the most marked temporal shifts in cell states are observed within one hour of birth and presumably underlie the massive physiological adaptations that must accompany the successful transition of a mammalian fetus to life outside the womb.


Subject(s)
Animals, Newborn , Embryo, Mammalian , Embryonic Development , Gastrula , Single-Cell Analysis , Time-Lapse Imaging , Animals , Female , Mice , Pregnancy , Animals, Newborn/embryology , Animals, Newborn/genetics , Cell Differentiation/genetics , Embryo, Mammalian/cytology , Embryo, Mammalian/embryology , Embryonic Development/genetics , Gastrula/cytology , Gastrula/embryology , Gastrulation/genetics , Kidney/cytology , Kidney/embryology , Mesoderm/cytology , Mesoderm/enzymology , Neurons/cytology , Neurons/metabolism , Retina/cytology , Retina/embryology , Somites/cytology , Somites/embryology , Time Factors , Transcription Factors/genetics , Transcription, Genetic , Organ Specificity/genetics
13.
Proteomics ; 24(8): e2300084, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38380501

ABSTRACT

Assigning statistical confidence estimates to discoveries produced by a tandem mass spectrometry proteomics experiment is critical to enabling principled interpretation of the results and assessing the cost/benefit ratio of experimental follow-up. The most common technique for computing such estimates is to use target-decoy competition (TDC), in which observed spectra are searched against a database of real (target) peptides and a database of shuffled or reversed (decoy) peptides. TDC procedures for estimating the false discovery rate (FDR) at a given score threshold have been developed for application at the level of spectra, peptides, or proteins. Although these techniques are relatively straightforward to implement, it is common in the literature to skip over the implementation details or even to make mistakes in how the TDC procedures are applied in practice. Here we present Crema, an open-source Python tool that implements several TDC methods of spectrum-, peptide- and protein-level FDR estimation. Crema is compatible with a variety of existing database search tools and provides a straightforward way to obtain robust FDR estimates.


Subject(s)
Algorithms , Peptides , Databases, Protein , Peptides/chemistry , Proteins/analysis , Proteomics/methods
14.
Lab Invest ; 104(1): 100282, 2024 01.
Article in English | MEDLINE | ID: mdl-37924947

ABSTRACT

Large-scale high-dimensional multiomics studies are essential to unravel molecular complexity in health and disease. We developed an integrated system for tissue sampling (CryoGrid), analytes preparation (PIXUL), and downstream multiomic analysis in a 96-well plate format (Matrix), MultiomicsTracks96, which we used to interrogate matched frozen and formalin-fixed paraffin-embedded (FFPE) mouse organs. Using this system, we generated 8-dimensional omics data sets encompassing 4 molecular layers of intracellular organization: epigenome (H3K27Ac, H3K4m3, RNA polymerase II, and 5mC levels), transcriptome (messenger RNA levels), epitranscriptome (m6A levels), and proteome (protein levels) in brain, heart, kidney, and liver. There was a high correlation between data from matched frozen and FFPE organs. The Segway genome segmentation algorithm applied to epigenomic profiles confirmed known organ-specific superenhancers in both FFPE and frozen samples. Linear regression analysis showed that proteomic profiles, known to be poorly correlated with transcriptomic data, can be more accurately predicted by the full suite of multiomics data, compared with using epigenomic, transcriptomic, or epitranscriptomic measurements individually.


Subject(s)
Formaldehyde , Proteomics , Mice , Animals , Fixatives , Tissue Fixation/methods , Proteomics/methods , Paraffin Embedding/methods
15.
bioRxiv ; 2023 Sep 22.
Article in English | MEDLINE | ID: mdl-37790381

ABSTRACT

Most studies of genome organization have focused on intra-chromosomal (cis) contacts because they harbor key features such as DNA loops and topologically associating domains. Inter-chromosomal (trans) contacts have received much less attention, and tools for interrogating potential biologically relevant trans structures are lacking. Here, we develop a computational framework to identify sets of loci that jointly interact in trans from Hi-C data. This method, trans-C, initiates probabilistic random walks with restarts from a set of seed loci to traverse an input Hi-C contact network, thereby identifying sets of trans-contacting loci. We validate trans-C in three increasingly complex models of established trans contacts: the Plasmodium falciparum var genes, the mouse olfactory receptor "Greek islands", and the human RBM20 cardiac splicing factory. We then apply trans-C to systematically test the hypothesis that genes co-regulated by the same trans-acting element (i.e., a transcription or splicing factor) co-localize in three dimensions to form "RNA factories" that maximize the efficiency and accuracy of RNA biogenesis. We find that many loci with multiple binding sites of the same transcription factor interact with one another in trans, especially those bound by transcription factors with intrinsically disordered domains. Similarly, clustered binding of a subset of RNA binding proteins correlates with trans interaction of the encoding loci. These findings support the existence of trans interacting chromatin domains (TIDs) driven by RNA biogenesis. Trans-C provides an efficient computational framework for studying these and other types of trans interactions, empowering studies of a poorly understood aspect of genome architecture.

16.
bioRxiv ; 2023 Oct 20.
Article in English | MEDLINE | ID: mdl-37905060

ABSTRACT

Cross-species comparison and prediction of gene expression profiles are important to understand regulatory changes during evolution and to transfer knowledge learned from model organisms to humans. Single-cell RNA-seq (scRNA-seq) profiles enable us to capture gene expression profiles with respect to variations among individual cells; however, cross-species comparison of scRNA-seq profiles is challenging because of data sparsity, batch effects, and the lack of one-to-one cell matching across species. Moreover, single-cell profiles are challenging to obtain in certain biological contexts, limiting the scope of hypothesis generation. Here we developed Icebear, a neural network framework that decomposes single-cell measurements into factors representing cell identity, species, and batch factors. Icebear enables accurate prediction of single-cell gene expression profiles across species, thereby providing high-resolution cell type and disease profiles in under-characterized contexts. Icebear also facilitates direct cross-species comparison of single-cell expression profiles for conserved genes that are located on the X chromosome in eutherian mammals but on autosomes in chicken. This comparison, for the first time, revealed evolutionary and diverse adaptations of X-chromosome upregulation in mammals.

17.
Nat Commun ; 14(1): 5086, 2023 08 22.
Article in English | MEDLINE | ID: mdl-37607941

ABSTRACT

The complex life cycle of Plasmodium falciparum requires coordinated gene expression regulation to allow host cell invasion, transmission, and immune evasion. Increasing evidence now suggests a major role for epigenetic mechanisms in gene expression in the parasite. In eukaryotes, many lncRNAs have been identified to be pivotal regulators of genome structure and gene expression. To investigate the regulatory roles of lncRNAs in P. falciparum we explore the intergenic lncRNA distribution in nuclear and cytoplasmic subcellular locations. Using nascent RNA expression profiles, we identify a total of 1768 lncRNAs, of which 718 (~41%) are novels in P. falciparum. The subcellular localization and stage-specific expression of several putative lncRNAs are validated using RNA-FISH. Additionally, the genome-wide occupancy of several candidate nuclear lncRNAs is explored using ChIRP. The results reveal that lncRNA occupancy sites are focal and sequence-specific with a particular enrichment for several parasite-specific gene families, including those involved in pathogenesis and sexual differentiation. Genomic and phenotypic analysis of one specific lncRNA demonstrate its importance in sexual differentiation and reproduction. Our findings bring a new level of insight into the role of lncRNAs in pathogenicity, gene regulation and sexual differentiation, opening new avenues for targeted therapeutic strategies against the deadly malaria parasite.


Subject(s)
Malaria, Falciparum , Malaria , Parasites , RNA, Long Noncoding , Humans , Animals , Plasmodium falciparum/genetics , RNA, Long Noncoding/genetics , Malaria, Falciparum/genetics
18.
Bioinformatics ; 39(7)2023 07 01.
Article in English | MEDLINE | ID: mdl-37421399

ABSTRACT

MOTIVATION: Modality matching in single-cell omics data analysis-i.e. matching cells across datasets collected using different types of genomic assays-has become an important problem, because unifying perspectives across different technologies holds the promise of yielding biological and clinical discoveries. However, single-cell dataset sizes can now reach hundreds of thousands to millions of cells, which remain out of reach for most multimodal computational methods. RESULTS: We propose LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. In LSMMD-MA, we reformulate the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation in Python. We show that LSMMD-MA scales to a million cells in each modality, two orders of magnitude greater than existing implementations. AVAILABILITY AND IMPLEMENTATION: LSMMD-MA is freely available at https://github.com/google-research/large_scale_mmdma and archived at https://doi.org/10.5281/zenodo.8076311.


Subject(s)
Genome , Genomics , Genomics/methods , Research Design , Data Analysis , Single-Cell Analysis , Software
19.
Nat Commun ; 14(1): 3303, 2023 06 06.
Article in English | MEDLINE | ID: mdl-37280210

ABSTRACT

Nuclear compartments are prominent features of 3D chromatin organization, but sequencing depth limitations have impeded investigation at ultra fine-scale. CTCF loops are generally studied at a finer scale, but the impact of looping on proximal interactions remains enigmatic. Here, we critically examine nuclear compartments and CTCF loop-proximal interactions using a combination of in situ Hi-C at unparalleled depth, algorithm development, and biophysical modeling. Producing a large Hi-C map with 33 billion contacts in conjunction with an algorithm for performing principal component analysis on sparse, super massive matrices (POSSUMM), we resolve compartments to 500 bp. Our results demonstrate that essentially all active promoters and distal enhancers localize in the A compartment, even when flanking sequences do not. Furthermore, we find that the TSS and TTS of paused genes are often segregated into separate compartments. We then identify diffuse interactions that radiate from CTCF loop anchors, which correlate with strong enhancer-promoter interactions and proximal transcription. We also find that these diffuse interactions depend on CTCF's RNA binding domains. In this work, we demonstrate features of fine-scale chromatin organization consistent with a revised model in which compartments are more precise than commonly thought while CTCF loops are more protracted.


Subject(s)
Chromatin , Enhancer Elements, Genetic , Chromatin/genetics , CCCTC-Binding Factor/genetics , CCCTC-Binding Factor/metabolism , Enhancer Elements, Genetic/genetics , Cell Nucleus/genetics , Cell Nucleus/metabolism , Promoter Regions, Genetic
20.
bioRxiv ; 2023 Dec 11.
Article in English | MEDLINE | ID: mdl-37131601

ABSTRACT

Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation as well as the identification of exogenously placed DNA N6-methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as co-processing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce fibertools, a state-of-the-art toolkit that features a semi-supervised convolutional neural network for fast and accurate identification of m6A-marked bases using PacBio single-molecule long-read sequencing, as well as the co-processing of long-read genetic and epigenetic data produced using either PacBio or Oxford Nanopore sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kilobase long DNA molecules with a ~1,000-fold improvement in speed. In addition, we demonstrate that fibertools can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.

SELECTION OF CITATIONS
SEARCH DETAIL