Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
1.
Nat Methods ; 21(4): 574-583, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38459383

RESUMO

Draft genomes generated from Oxford Nanopore Technologies (ONT) long reads are known to have a higher error rate. Although existing genome polishers can enhance their quality, the error rate (including mismatches, indels and switching errors between paternal and maternal haplotypes) can be significant. Here, we develop two polishers, hypo-short and hypo-hybrid to address this issue. Hypo-short utilizes Illumina short reads to polish an ONT-based draft assembly, resulting in a high-quality assembly with low error rates and switching errors. Expanding on this, hypo-hybrid incorporates ONT long reads to further refine the assembly into a diploid representation. Leveraging on hypo-hybrid, we have created a diploid genome assembly pipeline called hypo-assembler. Hypo-assembler automates the generation of highly accurate, contiguous and nearly complete diploid assemblies using ONT long reads, Illumina short reads and optionally Hi-C reads. Notably, our solution even allows for the production of telomere-to-telomere diploid genomes with additional manual steps. As a proof of concept, we successfully assembled a fully phased telomere-to-telomere diploid genome of HG00733, achieving a quality value exceeding 50.


Assuntos
Nanoporos , Diploide , Haploidia , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Telômero/genética , Análise de Sequência de DNA/métodos
2.
Nucleic Acids Res ; 49(6): e33, 2021 04 06.
Artigo em Inglês | MEDLINE | ID: mdl-33444454

RESUMO

A significant portion of human cancers are due to viruses integrating into human genomes. Therefore, accurately predicting virus integrations can help uncover the mechanisms that lead to many devastating diseases. Virus integrations can be called by analysing second generation high-throughput sequencing datasets. Unfortunately, existing methods fail to report a significant portion of integrations, while predicting a large number of false positives. We observe that the inaccuracy is caused by incorrect alignment of reads in repetitive regions. False alignments create false positives, while missing alignments create false negatives. This paper proposes SurVirus, an improved virus integration caller that corrects the alignment of reads which are crucial for the discovery of integrations. We use publicly available datasets to show that existing methods predict hundreds of thousands of false positives; SurVirus, on the other hand, is significantly more precise while it also detects many novel integrations previously missed by other tools, most of which are in repetitive regions. We validate a subset of these novel integrations, and find that the majority are correct. Using SurVirus, we find that HPV and HBV integrations are enriched in LINE and Satellite regions which had been overlooked, as well as discover recurrent HBV and HPV breakpoints in human genome-virus fusion transcripts.


Assuntos
Algoritmos , Integração Viral , Alphapapillomavirus/genética , Conjuntos de Dados como Assunto , Genoma Humano , Vírus da Hepatite B/genética , Humanos , Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de RNA , Software
3.
Nucleic Acids Res ; 49(19): 10879-10894, 2021 11 08.
Artigo em Inglês | MEDLINE | ID: mdl-34643730

RESUMO

Large indels greatly impact the observable phenotypes in different organisms including plants and human. Hence, extracting large indels with high precision and sensitivity is important. Here, we developed IndelEnsembler to detect large indels in 1047 Arabidopsis whole-genome sequencing data. IndelEnsembler identified 34 093 deletions, 12 913 tandem duplications and 9773 insertions. Our large indel dataset was more comprehensive and accurate compared with the previous dataset of AthCNV (1). We captured nearly twice of the ground truth deletions and on average 27% more ground truth duplications compared with AthCNV, though our dataset has less number of large indels compared with AthCNV. Our large indels were positively correlated with transposon elements across the Arabidopsis genome. The non-homologous recombination events were the major formation mechanism of deletions in Arabidopsis genome. The Neighbor joining (NJ) tree constructed based on IndelEnsembler's deletions clearly divided the geographic subgroups of 1047 Arabidopsis. More importantly, our large indels represent a previously unassessed source of genetic variation. Approximately 49% of the deletions have low linkage disequilibrium (LD) with surrounding single nucleotide polymorphisms. Some of them could affect trait performance. For instance, using deletion-based genome-wide association study (DEL-GWAS), the accessions containing a 182-bp deletion in AT1G11520 had delayed flowering time and all accessions in north Sweden had the 182-bp deletion. We also found the accessions with 65-bp deletion in the first exon of AT4G00650 (FRI) flowered earlier than those without it. These two deletions cannot be detected in AthCNV and, interestingly, they do not co-occur in any Arabidopsis thaliana accession. By SNP-GWAS, surrounding SNPs of these two deletions do not correlate with flowering time. This example demonstrated that existing large indel datasets miss phenotypic variations and our large indel dataset filled in the gap.


Assuntos
Arabidopsis/genética , Flores/genética , Regulação da Expressão Gênica de Plantas , Genoma de Planta , Mutação INDEL , Software , Arabidopsis/classificação , Arabidopsis/crescimento & desenvolvimento , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Elementos de DNA Transponíveis , Conjuntos de Dados como Assunto , Flores/crescimento & desenvolvimento , Flores/metabolismo , Duplicação Gênica , Regulação da Expressão Gênica no Desenvolvimento , Estudo de Associação Genômica Ampla , Desequilíbrio de Ligação , Fenótipo , Polimorfismo de Nucleotídeo Único , Característica Quantitativa Herdável , Recombinação Genética
4.
Bioinformatics ; 37(11): 1497-1505, 2021 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-30989231

RESUMO

MOTIVATION: Structural variations (SVs) are large scale mutations in a genome; although less frequent than point mutations, due to their large size they are responsible for more heritable differences between individuals. Two prominent classes of SVs are deletions and tandem duplications. They play important roles in many devastating genetic diseases, such as Smith-Magenis syndrome, Potocki-Lupski syndrome and Williams-Beuren syndrome. Since paired-end whole genome sequencing data have become widespread and affordable, reliably calling deletions and tandem duplications has been a major target in bioinformatics; unfortunately, the problem is far from being solved, since existing solutions often offer poor results when applied to real data. RESULTS: We developed a novel caller, SurVIndel, which focuses on detecting deletions and tandem duplications from paired next-generation sequencing data. SurVIndel uses discordant paired reads, clipped reads as well as statistical methods. We show that SurVIndel outperforms existing methods on both simulated and real biological datasets. AVAILABILITY AND IMPLEMENTATION: SurVIndel is available at https://github.com/Mesh89/SurVIndel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

5.
Nucleic Acids Res ; 46(20): e122, 2018 11 16.
Artigo em Inglês | MEDLINE | ID: mdl-30137425

RESUMO

Transpositions transfer DNA segments between different loci within a genome; in particular, when a transposition is found in a sample but not in a reference genome, it is called a non-reference transposition. They are important structural variations that have clinical impact. Transpositions can be called by analyzing second generation high-throughput sequencing datasets. Current methods follow either a database-based or a database-free approach. Database-based methods require a database of transposable elements. Some of them have good specificity; however this approach cannot detect novel transpositions, and it requires a good database of transposable elements, which is not yet available for many species. Database-free methods perform de novo calling of transpositions, but their accuracy is low. We observe that this is due to the misalignment of the reads; since reads are short and the human genome has many repeats, false alignments create false positive predictions while missing alignments reduce the true positive rate. This paper proposes new techniques to improve database-free non-reference transposition calling: first, we propose a realignment strategy called one-end remapping that corrects the alignments of reads in interspersed repeats; second, we propose a SNV-aware filter that removes some incorrectly aligned reads. By combining these two techniques and other techniques like clustering and positive-to-negative ratio filter, our proposed transposition caller TranSurVeyor shows at least 3.1-fold improvement in terms of F1-score over existing database-free methods. More importantly, even though TranSurVeyor does not use databases of prior information, its performance is at least as good as existing database-based methods such as MELT, Mobster and Retroseq. We also illustrate that TranSurVeyor can discover transpositions that are not known in the current database.


Assuntos
Algoritmos , Biologia Computacional/métodos , Elementos de DNA Transponíveis/genética , Bases de Dados Factuais , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genoma Humano/genética , Genômica/métodos , Humanos , Mutagênese Insercional , Reprodutibilidade dos Testes
6.
Nat Commun ; 14(1): 3243, 2023 06 05.
Artigo em Inglês | MEDLINE | ID: mdl-37277343

RESUMO

Insertions are one of the major types of structural variations and are defined as the addition of 50 nucleotides or more into a DNA sequence. Several methods exist to detect insertions from next-generation sequencing short read data, but they generally have low sensitivity. Our contribution is two-fold. First, we introduce INSurVeyor, a fast, sensitive and precise method that detects insertions from next-generation sequencing paired-end data. Using publicly available benchmark datasets (both human and non-human), we show that INSurVeyor is not only more sensitive than any individual caller we tested, but also more sensitive than all of them combined. Furthermore, for most types of insertions, INSurVeyor is almost as sensitive as long reads callers. Second, we provide state-of-the-art catalogues of insertions for 1047 Arabidopsis Thaliana genomes from the 1001 Genomes Project and 3202 human genomes from the 1000 Genomes Project, both generated with INSurVeyor. We show that they are more complete and precise than existing resources, and important insertions are missed by existing methods.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos
7.
J Comput Biol ; 26(9): 893-907, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-30990336

RESUMO

The previous fastest algorithm for computing the rooted triplet distance between two input galled trees (i.e., phylogenetic networks whose cycles are vertex-disjoint) runs in [Formula: see text] time, where n is the cardinality of the leaf label set. In this article, we present an [Formula: see text]-time solution. Our strategy is to transform the input so that the answer can be obtained by applying an existing [Formula: see text]-time algorithm for the simpler case of two phylogenetic trees a constant number of times. The new algorithm has been implemented, and applying it to pairs of randomly generated galled trees with up to [Formula: see text] leaves confirms that it is fast in practice.


Assuntos
Algoritmos , Biologia Computacional/métodos , Filogenia , Biologia Computacional/normas
8.
J Comput Biol ; 25(7): 740-754, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-29451395

RESUMO

The [Formula: see text] Consistency problem takes as input two sets [Formula: see text] and [Formula: see text] of resolved triplets and two sets [Formula: see text] and [Formula: see text] of fan triplets, and asks for a distinctly leaf-labeled tree that contains all elements in [Formula: see text] and no elements in [Formula: see text] as embedded subtrees, if such a tree exists. This article presents a detailed characterization of how the computational complexity of the problem changes under various restrictions. Our main result is an efficient algorithm for dense inputs satisfying [Formula: see text] whose running time is linear in the size of the input and therefore optimal.


Assuntos
Biologia Computacional/estatística & dados numéricos , Filogenia , Algoritmos , Humanos , Modelos Genéticos
9.
Artigo em Inglês | MEDLINE | ID: mdl-27662679

RESUMO

This article presents two new deterministic algorithms for constructing consensus trees. Given an input of  phylogenetic trees with identical leaf label sets and  leaves each, the first algorithm constructs the majority rule (+) consensus tree in time, which is optimal since the input size is , and the second one constructs the frequency difference consensus tree in time.


Assuntos
Algoritmos , Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Análise por Conglomerados
10.
J Comput Biol ; 24(2): 106-126, 2017 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-27983874

RESUMO

The rooted triplet distance is a measure of the dissimilarity of two phylogenetic trees with identical leaf label sets. An algorithm by Brodal et al. that computes it in [Formula: see text] time and [Formula: see text] space, where n is the number of leaf labels, has recently been implemented in the software package tqDist. In this article, we show that replacing the hierarchical decomposition tree used in Brodal et al.'s algorithm by a centroid paths-based data structure yields an [Formula: see text]-time and [Formula: see text]-space algorithm that, although slower in theory, is faster in practice as well as less memory consuming. Simulations for values of n up to 4,000,000 support our claims experimentally.


Assuntos
Algoritmos , Modelos Genéticos , Filogenia , Simulação por Computador , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA