Pesquisa | Biblioteca Virtual em Saúde Fiocruz

1.

Fast and accurate correction of optical mapping data via spaced seeds.

Salmela, Leena; Mukherjee, Kingshuk; Puglisi, Simon J; Muggli, Martin D; Boucher, Christina.

Bioinformatics ; 36(3): 682-689, 2020 02 01.

Artigo em Inglês | MEDLINE | ID: mdl-31504206

RESUMO

MOTIVATION: Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. RESULTS: We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. AVAILABILITY AND IMPLEMENTATION: Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Genômica , Software , Algoritmos , Genoma Humano , Humanos , Mapeamento por Restrição , Análise de Sequência de DNA

2.

Aligning optical maps to de Bruijn graphs.

Mukherjee, Kingshuk; Alipanahi, Bahar; Kahveci, Tamer; Salmela, Leena; Boucher, Christina.

Bioinformatics ; 35(18): 3250-3256, 2019 09 15.

Artigo em Inglês | MEDLINE | ID: mdl-30698651

RESUMO

MOTIVATION: Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps-called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. RESULTS: We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. AVAILABILITY AND IMPLEMENTATION: The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Genoma , Mapeamento por Restrição , Análise de Sequência de DNA

3.

Fast and accurate correction of optical mapping data via spaced seeds.

Salmela, Leena; Mukherjee, Kingshuk; Puglisi, Simon J; Muggli, Martin D; Boucher, Christina.

Bioinformatics ; 36(9): 2974, 2020 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-32187358

4.

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph.

Mukherjee, Kingshuk; Rossi, Massimiliano; Salmela, Leena; Boucher, Christina.

Algorithms Mol Biol ; 16(1): 6, 2021 May 25.

Artigo em Inglês | MEDLINE | ID: mdl-34034751

RESUMO

Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as RMAPPER, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) only successfully ran on E. coli. Moreover, on the human genome RMAPPER was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, RMAPPER is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper .

5.

Finding Overlapping Rmaps via Clustering.

Mukherjee, Kingshuk; Dole-Muinos, Daniel; Ajayi, Ayomide; Rossi, Massimiliano; Prosperi, Mattia; Boucher, Christina.

IEEE/ACM Trans Comput Biol Bioinform ; PP2021 Dec 10.

Artigo em Inglês | MEDLINE | ID: mdl-34890332

RESUMO

Optical mapping has been largely automated, and first produces single molecule restriction maps, called Rmaps, which are assembled to generate genome wide optical maps. Since the location and orientation of each Rmap is unknown, the first problem in the analysis of this data is finding related Rmaps, i.e., pairs of Rmaps that share the same orientation and have significant overlap in their genomic location. Although heuristics for identifying related Rmaps exist, they all require quantization of the data which leads to a loss in the precision. In this paper, we propose a Gaussian mixture modelling clustering based method, which we refer to as O, that finds overlapping Rmaps without quantization. Using both simulated and real datasets, we show that OMclust substantially improves the precision (from 48.3% to 73.3%) over the state-of-the art methods while also reducing CPU time and memory consumption. Further, we integrated OMclust into the error correction methods (Elmeri and Comet) to demonstrate the increase in the performance of these methods. When OMclust was combined with Comet to error correct Rmap data generated from human DNA, it was able to error correct close to 3x more Ramps, and reduced the CPU time by more than 35x.

6.

Mobilization of Antibiotic Resistance: Are Current Approaches for Colocalizing Resistomes and Mobilomes Useful?

Slizovskiy, Ilya B; Mukherjee, Kingshuk; Dean, Christopher J; Boucher, Christina; Noyes, Noelle R.

Front Microbiol ; 11: 1376, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32695079

RESUMO

Antimicrobial resistance (AMR) poses a global human and animal health threat, and predicting AMR persistence and transmission remains an intractable challenge. Shotgun metagenomic sequencing can help overcome this by enabling characterization of AMR genes within all bacterial taxa, most of which are uncultivatable in laboratory settings. Shotgun sequencing, therefore, provides a more comprehensive glance at AMR "potential" within samples, i.e., the "resistome." However, the risk inherent within a given resistome is predicated on the genomic context of various AMR genes, including their presence within mobile genetic elements (MGEs). Therefore, resistome risk stratification can be advanced if AMR profiles are considered in light of the flanking mobilizable genomic milieu (e.g., plasmids, integrative conjugative elements (ICEs), phages, and other MGEs). Because such mediators of horizontal gene transfer (HGT) are involved in uptake by pathogens, investigators are increasingly interested in characterizing that resistome fraction in genomic proximity to HGT mediators, i.e., the "mobilome"; we term this "colocalization." We explored the utility of common colocalization approaches using alignment- and assembly-based techniques, on clinical (human) and agricultural (cattle) fecal metagenomes, obtained from antimicrobial use trials. Ordination revealed that tulathromycin-treated cattle experienced a shift in ICE and plasmid composition versus untreated animals, though the resistome was unaffected during the monitoring period. Contrarily, the human resistome and mobilome composition both shifted shortly after antimicrobial administration, though this rebounded to pre-treatment status. Bayesian networks revealed statistical AMR-MGE co-occurrence in 19 and 2% of edges from the cattle and human networks, respectively, suggesting a putatively greater mobility potential of AMR in cattle feces. Conversely, using Mobility Index (MI) and overlap analysis, abundance of de novo-assembled contigs supporting resistomes flanked by MGE increased shortly post-exposure within human metagenomes, though > 40 days after peak dose such contigs were rare (â¼2%). MI was not substantially altered by antimicrobial exposure across all cattle metagenomes, ranging 0.5-4.0%. We highlight that current alignment- and assembly-based methods estimating resistome mobility yield contradictory and incomplete results, likely constrained by approach-specific data inputs, and bioinformatic limitations. We discuss recent laboratory and computational advancements that may enhance resistome risk analysis in clinical, regulatory, and commercial applications.

7.

Counting motifs in dynamic networks.

Mukherjee, Kingshuk; Hasan, Md Mahmudul; Boucher, Christina; Kahveci, Tamer.

BMC Syst Biol ; 12(Suppl 1): 6, 2018 04 11.

Artigo em Inglês | MEDLINE | ID: mdl-29671392

RESUMO

BACKGROUND: A network motif is a sub-network that occurs frequently in a given network. Detection of such motifs is important since they uncover functions and local properties of the given biological network. Finding motifs is however a computationally challenging task as it requires solving the costly subgraph isomorphism problem. Moreover, the topology of biological networks change over time. These changing networks are called dynamic biological networks. As the network evolves, frequency of each motif in the network also changes. Computing the frequency of a given motif from scratch in a dynamic network as the network topology evolves is infeasible, particularly for large and fast evolving networks. RESULTS: In this article, we design and develop a scalable method for counting the number of motifs in a dynamic biological network. Our method incrementally updates the frequency of each motif as the underlying network's topology evolves. Our experiments demonstrate that our method can update the frequency of each motif in orders of magnitude faster than counting the motif embeddings every time the network changes. If the network evolves more frequently, the margin with which our method outperforms the existing static methods, increases. CONCLUSIONS: We evaluated our method extensively using synthetic and real datasets, and show that our method is highly accurate(≥ 96%) and that it can be scaled to large dense networks. The results on real data demonstrate the utility of our method in revealing interesting insights on the evolution of biological processes.

Assuntos

Biologia Computacional/métodos , Algoritmos , Linhagem Celular , Gráficos por Computador , Humanos , Reconhecimento Automatizado de Padrão , Fatores de Tempo

8.

Error correcting optical mapping data.

Mukherjee, Kingshuk; Washimkar, Darshan; Muggli, Martin D; Salmela, Leena; Boucher, Christina.

Gigascience ; 7(6)2018 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-29846578

RESUMO

Optical mapping is a unique system that is capable of producing high-resolution, high-throughput genomic map data that gives information about the structure of a genome . Recently it has been used for scaffolding contigs and for assembly validation for large-scale sequencing projects, including the maize, goat, and Amborella genomes. However, a major impediment in the use of this data is the variety and quantity of errors in the raw optical mapping data, which are called Rmaps. The challenges associated with using Rmap data are analogous to dealing with insertions and deletions in the alignment of long reads. Moreover, they are arguably harder to tackle since the data are numerical and susceptible to inaccuracy. We develop cOMet to error correct Rmap data, which to the best of our knowledge is the only optical mapping error correction method. Our experimental results demonstrate that cOMet has high prevision and corrects 82.49% of insertion errors and 77.38% of deletion errors in Rmap data generated from the Escherichia coli K-12 reference genome. Out of the deletion errors corrected, 98.26% are true errors. Similarly, out of the insertion errors corrected, 82.19% are true errors. It also successfully scales to large genomes, improving the quality of 78% and 99% of the Rmaps in the plum and goat genomes, respectively. Last, we show the utility of error correction by demonstrating how it improves the assembly of Rmap data. Error corrected Rmap data results in an assembly that is more contiguous and covers a larger fraction of the genome.

Assuntos

Mapeamento Cromossômico/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Animais , Simulação por Computador , Bases de Dados Genéticas , Escherichia coli/genética , Genoma , Cabras/genética , Prunus domestica/genética , Alinhamento de Sequência

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA