Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 23(1): 545, 2022 Dec 16.
Artigo em Inglês | MEDLINE | ID: mdl-36526983

RESUMO

BACKGROUND: Human subtelomeric DNA regulates the length and stability of adjacent telomeres that are critical for cellular function, and contains many gene/pseudogene families. Large evolutionarily recent segmental duplications and associated structural variation in human subtelomeres has made complete sequencing and assembly of these regions difficult to impossible for many loci, complicating or precluding a wide range of genetic analyses to investigate their function. RESULTS: We present a hybrid assembly method, NanoPore Guided REgional Assembly Tool (NPGREAT), which combines Linked-Read data with mapped ultralong nanopore reads spanning subtelomeric segmental duplications to potentially overcome these difficulties. Linked-Read sets of DNA sequences identified by matches with 1-copy subtelomere sequence adjacent to segmental duplications are assembled and extended into the segmental duplication regions using Regional Extension of Assemblies using Linked-Reads (REXTAL). Mapped telomere-containing ultralong nanopore reads are then used to provide contiguity and correct orientation for matching REXTAL sequence contigs as well as identification/correction of any misassemblies. Our method was tested for a subset of representative subtelomeres with ultralong nanopore read coverage in the haploid human cell line CHM13. A 10X Linked-Read dataset from CHM13 was combined with ultralong nanopore reads from the same genome to provide improved subtelomere assemblies. Comparison of Nanopore-only assemblies using SHASTA with our NPGREAT assemblies in the distal-most subtelomere regions showed that NPGREAT produced higher-quality and more complete assemblies than SHASTA alone when these regions had low ultralong nanopore coverage (such as cases where large segmental duplications were immediately adjacent to (TTAGGG) tracts). CONCLUSION: In genomic regions with large segmental duplications adjacent to telomeres, NPGREAT offers an alternative economical approach to improving assembly accuracy and coverage using linked-read datasets when more expensive HiFi datasets of 10-20 kb reads are unavailable.


Assuntos
Nanoporos , Humanos , Genômica , Telômero/genética , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos
2.
Artigo em Inglês | MEDLINE | ID: mdl-31056507

RESUMO

Genomic regions of high segmental duplication content and/or structural variation have led to gaps and misassemblies in the human reference sequence, and are refractory to assembly from whole-genome short-read datasets. Human subtelomere regions are highly enriched in both segmental duplication content and structural variations, and as a consequence are both impossible to assemble accurately and highly variable from individual to individual. Recently, we developed a pipeline for improved region-specific assembly called Regional Extension of Assemblies Using Linked-Reads (REXTAL). In this study, we evaluate REXTAL and genome-wide assembly (Supernova) approaches on 10X Genomics linked-reads data sets partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method. Our results describe the accuracy and relative performance of these two approaches using the reference-based assessment module of QUAST. We show that REXTAL dramatically outperforms the Supernova whole genome assembler in subtelomeric segmental duplication regions, and results in highly accurate assemblies. Nearly all of the REXTAL "misassemblies" identified using default QUAST parameters simply pinpoint locations of tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by 1000 bp.


Assuntos
Estruturas Cromossômicas/genética , Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Genoma Humano/genética , Humanos
3.
Artigo em Inglês | MEDLINE | ID: mdl-33868775

RESUMO

Human subtelomere regions are highly enriched in large segmental duplications and structural variants, leading to many gaps and misassemblies in these regions. We develop a novel method, NPGREAT (NanoPore Guided REgional Assembly Tool), which combines Nanopore ultralong read datasets and short-read assemblies derived from 10x linked-reads to efficiently assemble these subtelomere regions into a single continuous sequence. We show that with the use of ultralong Nanopore reads as a guide, the highly accurate shorter linked-read sequence contigs are correctly oriented, ordered, spaced and extended. In the rare cases where a linked-read sequence contig contains inaccurately assembled segments, the use of Nanopore reads allows for detection and correction of this error. We tested NPGREAT on four representative subtelomeres of the NA12878 human genome (10p, 16p, 19q and 20p). The results demonstrate that the final computed assembly of each subtelomere is accurate and complete.

4.
Bioinform Res Appl (2018) ; 10847: 63-78, 2018 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-32016171

RESUMO

It is currently impossible to get complete de-novo assembly of segmentally duplicated genome regions using genome-wide short-read datasets. Here, we devise a new computational method called Regional Extension of Assemblies Using Linked-Reads (REXTAL) for improved region-specific assembly of segmental duplication-containing DNA, leveraging genomic short-read datasets generated from large DNA molecules partitioned and barcoded using the "Gel Bead in Emulsion" (GEM) microfluidic method (Zheng et al., 2016). We show that using REXTAL, it is possible to extend assembly of single-copy diploid DNA into adjacent, otherwise inaccessible subtelomere segmental duplication regions and other subtelomeric gap regions. Moreover, REXTAL is computationally more efficient for the directed assembly of such regions from multiple genomes (e.g., for the comparison of structural variation) than genome-wide assembly approaches.

5.
Artigo em Inglês | MEDLINE | ID: mdl-27008671

RESUMO

A key idea in de novo modeling of a medium-resolution density image obtained from cryo-electron microscopy is to compute the optimal mapping between the secondary structure traces observed in the density image and those predicted on the protein sequence. When secondary structures are not determined precisely, either from the image or from the amino acid sequence of the protein, the computational problem becomes more complex. We present an efficient method that addresses the secondary structure placement problem in presence of multiple secondary structure predictions and computes the optimal mapping. We tested the method using 12 simulated images from α-proteins and two Cryo-EM images of α-ß proteins. We observed that the rank of the true topologies is consistently improved by using multiple secondary structure predictions instead of a single prediction. The results show that the algorithm is robust and works well even when errors/misses in the predicted secondary structures are present in the image or the sequence. The results also show that the algorithm is efficient and is able to handle proteins with as many as 33 helices.


Assuntos
Biologia Computacional/métodos , Microscopia Crioeletrônica/métodos , Processamento de Imagem Assistida por Computador/métodos , Proteínas/química , Algoritmos , Modelos Moleculares , Estrutura Secundária de Proteína , Proteínas/metabolismo
6.
Artigo em Inglês | MEDLINE | ID: mdl-29770261

RESUMO

Cryo-electron microscopy is a fast emerging biophysical technique for structural determination of large protein complexes. While more atomic structures are being determined using this technique, it is still challenging to derive atomic structures from density maps produced at medium resolution when no suitable templates are available. A critical step in structure determination is how a protein chain threads through the 3-dimensional density map. A dynamic programming method was previously developed to generate K best matches of secondary structures between the density map and its protein sequence using shortest paths in a related weighted graph. We discuss challenges associated with the creation of the weighted graph and explore heuristic methods to solve the problem of matching secondary structures.

7.
J Comput Biol ; 22(9): 837-43, 2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-26244416

RESUMO

The determination of secondary structure topology is a critical step in deriving the atomic structures from the protein density maps obtained from electron cryomicroscopy technique. This step often relies on matching the secondary structure traces detected from the protein density map to the secondary structure sequence segments predicted from the amino acid sequence. Due to inaccuracies in both sources of information, a pool of possible secondary structure positions needs to be sampled. One way to approach the problem is to first derive a small number of possible topologies using existing matching algorithms, and then find the optimal placement for each possible topology. We present a dynamic programming method of Θ(Nq(2)h) to find the optimal placement for a secondary structure topology. We show that our algorithm requires significantly less computational time than the brute force method that is in the order of Θ(q(N) h).


Assuntos
Biologia Computacional/métodos , Estrutura Secundária de Proteína , Proteínas/química , Algoritmos , Sequência de Aminoácidos , Microscopia Crioeletrônica/métodos , Dados de Sequência Molecular
8.
Bioinformatics ; 31(21): 3406-12, 2015 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-26116929

RESUMO

MOTIVATION: Insertion sequences (ISs) are transposable elements present in most bacterial and archaeal genomes that play an important role in genomic evolution. The increasing availability of sequenced prokaryotic genomes offers the opportunity to study ISs comprehensively, but development of efficient and accurate tools is required for discovery and annotation. Additionally, prokaryotic genomes are frequently deposited as incomplete, or draft stage because of the substantial cost and effort required to finish genome assembly projects. Development of methods to identify IS directly from raw sequence reads or draft genomes are therefore desirable. Software tools such as Optimized Annotation System for Insertion Sequences and IScan currently identify IS elements in completely assembled and annotated genomes; however, to our knowledge no methods have been developed to identify ISs from raw fragment data or partially assembled genomes. We have developed novel methods to solve this computationally challenging problem, and implemented these methods in the software package ISQuest. This software identifies bacterial ISs and their sequence elements-inverted and direct repeats-in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours, making it a valuable high-throughput tool for a global search of IS elements. We tested ISQuest on simulated read libraries of 3810 complete bacterial genomes and plasmids in GenBank and were capable of detecting 82% of the ISs and transposases annotated in GenBank with 80% sequence identity. CONTACT: abiswas@cs.odu.edu.


Assuntos
Elementos de DNA Transponíveis/genética , Bases de Dados de Ácidos Nucleicos , Genoma Arqueal , Genoma Bacteriano , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Mapeamento Cromossômico
9.
Artigo em Inglês | MEDLINE | ID: mdl-26355788

RESUMO

Electron cryomicroscopy is becoming a major experimental technique in solving the structures of large molecular assemblies. More and more three-dimensional images have been obtained at the medium resolutions between 5 and 10 Å. At this resolution range, major α-helices can be detected as cylindrical sticks and ß-sheets can be detected as plain-like regions. A critical question in de novo modeling from cryo-EM images is to determine the match between the detected secondary structures from the image and those on the protein sequence. We formulate this matching problem into a constrained graph problem and present an O(Δ(2)N(2)2(N)) algorithm to this NP-Hard problem. The algorithm incorporates the dynamic programming approach into a constrained K-shortest path algorithm. Our method, DP-TOSS, has been tested using α-proteins with maximum 33 helices and α-ß proteins up to five helices and 12 ß-strands. The correct match was ranked within the top 35 for 19 of the 20 α-proteins and all nine α-ß proteins tested. The results demonstrate that DP-TOSS improves accuracy, time and memory space in deriving the topologies of the secondary structure elements for proteins with a large number of secondary structures and a complex skeleton.


Assuntos
Algoritmos , Biologia Computacional/métodos , Microscopia Crioeletrônica/métodos , Modelos Moleculares , Estrutura Secundária de Proteína , Proteínas/química , Imageamento Tridimensional
10.
J Bioinform Comput Biol ; 10(3): 1242006, 2012 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-22809382

RESUMO

The determination of the secondary structure topology is a critical step in deriving the atomic structure from the protein density map obtained from electron cryo-microscopy technique. This step often relies on the matching of two sources of information. One source comes from the secondary structures detected from the protein density map at the medium resolution, such as 5-10 Å. The other source comes from the predicted secondary structures from the amino acid sequence. Due to the inaccuracy in either source of information, a pool of possible secondary structure positions needs to be sampled. This paper studies the question, that is, how to reduce the computation of the mapping when the inaccuracy of the secondary structure predictions is considered. We present a method that combines the concept of dynamic graph with our previous work of using constrained shortest path to identify the topology of the secondary structures. We show a reduction of 34.55% of run-time as comparison to the naïve way of handling the inaccuracies. We also show an improved accuracy when the potential secondary structure errors are explicitly sampled verses the use of one consensus prediction. Our framework demonstrated the potential of developing computationally effective exact algorithms to identify the optimal topology of the secondary structures when the inaccuracy of the predicted data is considered.


Assuntos
Microscopia Crioeletrônica , Estrutura Secundária de Proteína , Proteínas/química , Algoritmos , Bases de Dados de Proteínas , Dobramento de Proteína
11.
J Bioinform Comput Biol ; 9(3): 415-30, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-21714133

RESUMO

Electron cryo-microscopy is a fast advancing biophysical technique to derive three-dimensional structures of large protein complexes. Using this technique, many density maps have been generated at intermediate resolution such as 6-10 Å resolution. Although it is challenging to derive the backbone of the protein directly from such density maps, secondary structure elements such as helices and ß-sheets can be computationally detected. Our work in this paper provides an approach to enumerate the top-ranked possible topologies instead of enumerating the entire population of the topologies. This approach is particularly practical for large proteins. We developed a directed weighted graph, the topology graph, to represent the secondary structure assignment problem. We prove that the problem of finding the valid topology with the minimum cost is NP hard. We developed an O(N(2)2(N)) dynamic programming algorithm to identify the topology with the minimum cost. The test of 15 proteins suggests that our dynamic programming approach is feasible to work with proteins of much larger size than we could before. The largest protein in the test contains 18 helical sticks detected from the density map out of 33 helices in the protein.


Assuntos
Modelos Moleculares , Complexos Multiproteicos/química , Complexos Multiproteicos/ultraestrutura , Estrutura Secundária de Proteína , Fenômenos Biofísicos , Biologia Computacional , Gráficos por Computador , Simulação por Computador , Microscopia Crioeletrônica , Software
12.
J Bioinform Comput Biol ; 1(2): 201-30, 2003 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-15290770

RESUMO

Domain experts think and reason at a high level of abstraction when they solve problems in their domain of expertise. We present the design and motivation behind a domain specific language, called phi LOG, to enable biologists to program solutions to phylogenetic inference problems at a very high level of abstraction. The implementation infrastructure (interpreter, compiler, debugger) for the DSL is automatically obtained through a software engineering framework based on Denotational Semantics and Logic Programming.


Assuntos
Algoritmos , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Filogenia , Linguagens de Programação , Análise de Sequência/métodos , Software , Design de Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA