Búsqueda | BVS Bolivia

1.

NPGREAT: assembly of human subtelomere regions with the use of ultralong nanopore reads and linked-reads.

Adam, Eleni; Ranjan, Desh; Riethman, Harold.

BMC Bioinformatics ; 23(1): 545, 2022 Dec 16.

Artículo en Inglés | MEDLINE | ID: mdl-36526983

RESUMEN

BACKGROUND: Human subtelomeric DNA regulates the length and stability of adjacent telomeres that are critical for cellular function, and contains many gene/pseudogene families. Large evolutionarily recent segmental duplications and associated structural variation in human subtelomeres has made complete sequencing and assembly of these regions difficult to impossible for many loci, complicating or precluding a wide range of genetic analyses to investigate their function. RESULTS: We present a hybrid assembly method, NanoPore Guided REgional Assembly Tool (NPGREAT), which combines Linked-Read data with mapped ultralong nanopore reads spanning subtelomeric segmental duplications to potentially overcome these difficulties. Linked-Read sets of DNA sequences identified by matches with 1-copy subtelomere sequence adjacent to segmental duplications are assembled and extended into the segmental duplication regions using Regional Extension of Assemblies using Linked-Reads (REXTAL). Mapped telomere-containing ultralong nanopore reads are then used to provide contiguity and correct orientation for matching REXTAL sequence contigs as well as identification/correction of any misassemblies. Our method was tested for a subset of representative subtelomeres with ultralong nanopore read coverage in the haploid human cell line CHM13. A 10X Linked-Read dataset from CHM13 was combined with ultralong nanopore reads from the same genome to provide improved subtelomere assemblies. Comparison of Nanopore-only assemblies using SHASTA with our NPGREAT assemblies in the distal-most subtelomere regions showed that NPGREAT produced higher-quality and more complete assemblies than SHASTA alone when these regions had low ultralong nanopore coverage (such as cases where large segmental duplications were immediately adjacent to (TTAGGG) tracts). CONCLUSION: In genomic regions with large segmental duplications adjacent to telomeres, NPGREAT offers an alternative economical approach to improving assembly accuracy and coverage using linked-read datasets when more expensive HiFi datasets of 10-20 kb reads are unavailable.

Asunto(s)

Nanoporos , Humanos , Genómica , Telómero/genética , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos

2.

Analysis of Subtelomeric REXTAL Assemblies Using QUAST.

Islam, Tunazzina; Ranjan, Desh; Zubair, Mohammad; Young, Eleanor; Xiao, Ming; Riethman, Harold.

IEEE/ACM Trans Comput Biol Bioinform ; 18(1): 365-372, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-31056507

RESUMEN

Genomic regions of high segmental duplication content and/or structural variation have led to gaps and misassemblies in the human reference sequence, and are refractory to assembly from whole-genome short-read datasets. Human subtelomere regions are highly enriched in both segmental duplication content and structural variations, and as a consequence are both impossible to assemble accurately and highly variable from individual to individual. Recently, we developed a pipeline for improved region-specific assembly called Regional Extension of Assemblies Using Linked-Reads (REXTAL). In this study, we evaluate REXTAL and genome-wide assembly (Supernova) approaches on 10X Genomics linked-reads data sets partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method. Our results describe the accuracy and relative performance of these two approaches using the reference-based assessment module of QUAST. We show that REXTAL dramatically outperforms the Supernova whole genome assembler in subtelomeric segmental duplication regions, and results in highly accurate assemblies. Nearly all of the REXTAL "misassemblies" identified using default QUAST parameters simply pinpoint locations of tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by 1000 bp.

Asunto(s)

Estructuras Cromosómicas/genética , Genómica/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Genoma Humano/genética , Humanos

3.

Nanopore Guided Assembly of Segmental Duplications near Telomeres.

Adam, Eleni; Islam, Tunazzina; Ranjan, Desh; Riethman, Harold.

Proc IEEE Int Symp Bioinformatics Bioeng ; 20192019 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-33868775

RESUMEN

Human subtelomere regions are highly enriched in large segmental duplications and structural variants, leading to many gaps and misassemblies in these regions. We develop a novel method, NPGREAT (NanoPore Guided REgional Assembly Tool), which combines Nanopore ultralong read datasets and short-read assemblies derived from 10x linked-reads to efficiently assemble these subtelomere regions into a single continuous sequence. We show that with the use of ultralong Nanopore reads as a guide, the highly accurate shorter linked-read sequence contigs are correctly oriented, ordered, spaced and extended. In the rare cases where a linked-read sequence contig contains inaccurately assembled segments, the use of Nanopore reads allows for detection and correction of this error. We tested NPGREAT on four representative subtelomeres of the NA12878 human genome (10p, 16p, 19q and 20p). The results demonstrate that the final computed assembly of each subtelomere is accurate and complete.

4.

REXTAL: Regional Extension of Assemblies Using Linked-Reads.

Islam, Tunazzina; Ranjan, Desh; Young, Eleanor; Xiao, Ming; Zubair, Mohammad; Riethman, Harold.

Bioinform Res Appl (2018) ; 10847: 63-78, 2018 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-32016171

RESUMEN

It is currently impossible to get complete de-novo assembly of segmentally duplicated genome regions using genome-wide short-read datasets. Here, we devise a new computational method called Regional Extension of Assemblies Using Linked-Reads (REXTAL) for improved region-specific assembly of segmental duplication-containing DNA, leveraging genomic short-read datasets generated from large DNA molecules partitioned and barcoded using the "Gel Bead in Emulsion" (GEM) microfluidic method (Zheng et al., 2016). We show that using REXTAL, it is possible to extend assembly of single-copy diploid DNA into adjacent, otherwise inaccessible subtelomere segmental duplication regions and other subtelomeric gap regions. Moreover, REXTAL is computationally more efficient for the directed assembly of such regions from multiple genomes (e.g., for the comparison of structural variation) than genome-wide assembly approaches.

5.

An Effective Computational Method Incorporating Multiple Secondary Structure Predictions in Topology Determination for Cryo-EM Images.

Biswas, Abhishek; Ranjan, Desh; Zubair, Mohammad; Zeil, Stephanie; Nasr, Kamal Al; He, Jing.

IEEE/ACM Trans Comput Biol Bioinform ; 14(3): 578-586, 2017.

Artículo en Inglés | MEDLINE | ID: mdl-27008671

RESUMEN

A key idea in de novo modeling of a medium-resolution density image obtained from cryo-electron microscopy is to compute the optimal mapping between the secondary structure traces observed in the density image and those predicted on the protein sequence. When secondary structures are not determined precisely, either from the image or from the amino acid sequence of the protein, the computational problem becomes more complex. We present an efficient method that addresses the secondary structure placement problem in presence of multiple secondary structure predictions and computes the optimal mapping. We tested the method using 12 simulated images from α-proteins and two Cryo-EM images of α-ß proteins. We observed that the rank of the true topologies is consistently improved by using multiple secondary structure predictions instead of a single prediction. The results show that the algorithm is robust and works well even when errors/misses in the predicted secondary structures are present in the image or the sequence. The results also show that the algorithm is efficient and is able to handle proteins with as many as 33 helices.

Asunto(s)

Biología Computacional/métodos , Microscopía por Crioelectrón/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Proteínas/química , Algoritmos , Modelos Moleculares , Estructura Secundaria de Proteína , Proteínas/metabolismo

6.

CHALLENGES IN MATCHING SECONDARY STRUCTURES IN CRYO-EM: AN EXPLORATION.

Haslam, Devin; Zubair, Mohammad; Ranjan, Desh; Biswas, Abhishek; He, Jing.

Proceedings (IEEE Int Conf Bioinformatics Biomed) ; 2016: 1714-1719, 2016 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-29770261

RESUMEN

Cryo-electron microscopy is a fast emerging biophysical technique for structural determination of large protein complexes. While more atomic structures are being determined using this technique, it is still challenging to derive atomic structures from density maps produced at medium resolution when no suitable templates are available. A critical step in structure determination is how a protein chain threads through the 3-dimensional density map. A dynamic programming method was previously developed to generate K best matches of secondary structures between the density map and its protein sequence using shortest paths in a related weighted graph. We discuss challenges associated with the creation of the weighted graph and explore heuristic methods to solve the problem of matching secondary structures.

7.

A Dynamic Programming Algorithm for Finding the Optimal Placement of a Secondary Structure Topology in Cryo-EM Data.

Biswas, Abhishek; Ranjan, Desh; Zubair, Mohammad; He, Jing.

J Comput Biol ; 22(9): 837-43, 2015 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-26244416

RESUMEN

The determination of secondary structure topology is a critical step in deriving the atomic structures from the protein density maps obtained from electron cryomicroscopy technique. This step often relies on matching the secondary structure traces detected from the protein density map to the secondary structure sequence segments predicted from the amino acid sequence. Due to inaccuracies in both sources of information, a pool of possible secondary structure positions needs to be sampled. One way to approach the problem is to first derive a small number of possible topologies using existing matching algorithms, and then find the optimal placement for each possible topology. We present a dynamic programming method of Θ(Nq(2)h) to find the optimal placement for a secondary structure topology. We show that our algorithm requires significantly less computational time than the brute force method that is in the order of Θ(q(N) h).

Asunto(s)

Biología Computacional/métodos , Estructura Secundaria de Proteína , Proteínas/química , Algoritmos , Secuencia de Aminoácidos , Microscopía por Crioelectrón/métodos , Datos de Secuencia Molecular

8.

ISQuest: finding insertion sequences in prokaryotic sequence fragment data.

Biswas, Abhishek; Gauthier, David T; Ranjan, Desh; Zubair, Mohammad.

Bioinformatics ; 31(21): 3406-12, 2015 Nov 01.

Artículo en Inglés | MEDLINE | ID: mdl-26116929

RESUMEN

MOTIVATION: Insertion sequences (ISs) are transposable elements present in most bacterial and archaeal genomes that play an important role in genomic evolution. The increasing availability of sequenced prokaryotic genomes offers the opportunity to study ISs comprehensively, but development of efficient and accurate tools is required for discovery and annotation. Additionally, prokaryotic genomes are frequently deposited as incomplete, or draft stage because of the substantial cost and effort required to finish genome assembly projects. Development of methods to identify IS directly from raw sequence reads or draft genomes are therefore desirable. Software tools such as Optimized Annotation System for Insertion Sequences and IScan currently identify IS elements in completely assembled and annotated genomes; however, to our knowledge no methods have been developed to identify ISs from raw fragment data or partially assembled genomes. We have developed novel methods to solve this computationally challenging problem, and implemented these methods in the software package ISQuest. This software identifies bacterial ISs and their sequence elements-inverted and direct repeats-in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours, making it a valuable high-throughput tool for a global search of IS elements. We tested ISQuest on simulated read libraries of 3810 complete bacterial genomes and plasmids in GenBank and were capable of detecting 82% of the ISs and transposases annotated in GenBank with 80% sequence identity. CONTACT: abiswas@cs.odu.edu.

Asunto(s)

Elementos Transponibles de ADN/genética , Bases de Datos de Ácidos Nucleicos , Genoma Arqueal , Genoma Bacteriano , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Mapeo Cromosómico

9.

Solving the Secondary Structure Matching Problem in Cryo-EM De Novo Modeling Using a Constrained K-Shortest Path Graph Algorithm.

Al Nasr, Kamal; Ranjan, Desh; Zubair, Mohammad; Chen, Lin; He, Jing.

IEEE/ACM Trans Comput Biol Bioinform ; 11(2): 419-30, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-26355788

RESUMEN

Electron cryomicroscopy is becoming a major experimental technique in solving the structures of large molecular assemblies. More and more three-dimensional images have been obtained at the medium resolutions between 5 and 10 Å. At this resolution range, major α-helices can be detected as cylindrical sticks and ß-sheets can be detected as plain-like regions. A critical question in de novo modeling from cryo-EM images is to determine the match between the detected secondary structures from the image and those on the protein sequence. We formulate this matching problem into a constrained graph problem and present an O(Δ(2)N(2)2(N)) algorithm to this NP-Hard problem. The algorithm incorporates the dynamic programming approach into a constrained K-shortest path algorithm. Our method, DP-TOSS, has been tested using α-proteins with maximum 33 helices and α-ß proteins up to five helices and 12 ß-strands. The correct match was ranked within the top 35 for 19 of the 20 α-proteins and all nine α-ß proteins tested. The results demonstrate that DP-TOSS improves accuracy, time and memory space in deriving the topologies of the secondary structure elements for proteins with a large number of secondary structures and a complex skeleton.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Microscopía por Crioelectrón/métodos , Modelos Moleculares , Estructura Secundaria de Proteína , Proteínas/química , Imagenología Tridimensional

10.

Improved efficiency in cryo-EM secondary structure topology determination from inaccurate data.

Biswas, Abhishek; Si, Dong; Al Nasr, Kamal; Ranjan, Desh; Zubair, Mohammad; He, Jing.

J Bioinform Comput Biol ; 10(3): 1242006, 2012 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-22809382

RESUMEN

The determination of the secondary structure topology is a critical step in deriving the atomic structure from the protein density map obtained from electron cryo-microscopy technique. This step often relies on the matching of two sources of information. One source comes from the secondary structures detected from the protein density map at the medium resolution, such as 5-10 Å. The other source comes from the predicted secondary structures from the amino acid sequence. Due to the inaccuracy in either source of information, a pool of possible secondary structure positions needs to be sampled. This paper studies the question, that is, how to reduce the computation of the mapping when the inaccuracy of the secondary structure predictions is considered. We present a method that combines the concept of dynamic graph with our previous work of using constrained shortest path to identify the topology of the secondary structures. We show a reduction of 34.55% of run-time as comparison to the naïve way of handling the inaccuracies. We also show an improved accuracy when the potential secondary structure errors are explicitly sampled verses the use of one consensus prediction. Our framework demonstrated the potential of developing computationally effective exact algorithms to identify the optimal topology of the secondary structures when the inaccuracy of the predicted data is considered.

Asunto(s)

Microscopía por Crioelectrón , Estructura Secundaria de Proteína , Proteínas/química , Algoritmos , Bases de Datos de Proteínas , Pliegue de Proteína

11.

Ranking valid topologies of the secondary structure elements using a constraint graph.

Al Nasr, Kamal; Ranjan, Desh; Zubair, Mohammad; He, Jing.

J Bioinform Comput Biol ; 9(3): 415-30, 2011 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-21714133

RESUMEN

Electron cryo-microscopy is a fast advancing biophysical technique to derive three-dimensional structures of large protein complexes. Using this technique, many density maps have been generated at intermediate resolution such as 6-10 Å resolution. Although it is challenging to derive the backbone of the protein directly from such density maps, secondary structure elements such as helices and ß-sheets can be computationally detected. Our work in this paper provides an approach to enumerate the top-ranked possible topologies instead of enumerating the entire population of the topologies. This approach is particularly practical for large proteins. We developed a directed weighted graph, the topology graph, to represent the secondary structure assignment problem. We prove that the problem of finding the valid topology with the minimum cost is NP hard. We developed an O(N(2)2(N)) dynamic programming algorithm to identify the topology with the minimum cost. The test of 15 proteins suggests that our dynamic programming approach is feasible to work with proteins of much larger size than we could before. The largest protein in the test contains 18 helical sticks detected from the density map out of 33 helices in the protein.

Asunto(s)

Modelos Moleculares , Complejos Multiproteicos/química , Complejos Multiproteicos/ultraestructura , Estructura Secundaria de Proteína , Fenómenos Biofísicos , Biología Computacional , Gráficos por Computador , Simulación por Computador , Microscopía por Crioelectrón , Programas Informáticos

12.

Design and implementation of a domain specific language for phylogenetic inference.

Pontelli, Enrico; Ranjan, Desh; Gupta, Gopal; Milligan, Brook.

J Bioinform Comput Biol ; 1(2): 201-30, 2003 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-15290770

RESUMEN

Domain experts think and reason at a high level of abstraction when they solve problems in their domain of expertise. We present the design and motivation behind a domain specific language, called phi LOG, to enable biologists to program solutions to phylogenetic inference problems at a very high level of abstraction. The implementation infrastructure (interpreter, compiler, debugger) for the DSL is automatically obtained through a software engineering framework based on Denotational Semantics and Logic Programming.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Filogenia , Lenguajes de Programación , Análisis de Secuencia/métodos , Programas Informáticos , Diseño de Software

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA