Pesquisa | Biblioteca Virtual em Saúde

HINGE: long-read assembly achieves optimal repeat resolution.

Kamath, Govinda M; Shomorony, Ilan; Xia, Fei; Courtade, Thomas A; Tse, David N.

Genome Res ; 27(5): 747-756, 2017 05.

Artigo em Inglês | MEDLINE | ID: mdl-28320918

RESUMO

Long-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce misassemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that seeks to achieve optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding "hinges" to reads for constructing an overlap graph where only unresolvable repeats are merged. As a result, HINGE combines the error resilience of overlap-based assemblers with repeat-resolution capabilities of de Bruijn graph assemblers. HINGE was evaluated on the long-read bacterial data sets from the NCTC project. HINGE produces more finished assemblies than Miniasm and the manual pipeline of NCTC based on the HGAP assembler and Circlator. HINGE also allows us to identify 40 data sets where unresolvable repeats prevent the reliable construction of a unique finished assembly. In these cases, HINGE outputs a visually interpretable assembly graph that encodes all possible finished assemblies consistent with the reads, while other approaches such as the NCTC pipeline and FALCON either fragment the assembly or resolve the ambiguity arbitrarily.

Assuntos

Mapeamento de Sequências Contíguas/métodos , Genômica/métodos , Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de DNA/métodos , Software , Animais , Mapeamento de Sequências Contíguas/normas , Genômica/normas , Humanos , Análise de Sequência de DNA/normas

An interpretable framework for clustering single-cell RNA-Seq datasets.

Zhang, Jesse M; Fan, Jue; Fan, H Christina; Rosenfeld, David; Tse, David N.

BMC Bioinformatics ; 19(1): 93, 2018 03 09.

Artigo em Inglês | MEDLINE | ID: mdl-29523077

RESUMO

BACKGROUND: With the recent proliferation of single-cell RNA-Seq experiments, several methods have been developed for unsupervised analysis of the resulting datasets. These methods often rely on unintuitive hyperparameters and do not explicitly address the subjectivity associated with clustering. RESULTS: In this work, we present DendroSplit, an interpretable framework for analyzing single-cell RNA-Seq datasets that addresses both the clustering interpretability and clustering subjectivity issues. DendroSplit offers a novel perspective on the single-cell RNA-Seq clustering problem motivated by the definition of "cell type", allowing us to cluster using feature selection to uncover multiple levels of biologically meaningful populations in the data. We analyze several landmark single-cell datasets, demonstrating both the method's efficacy and computational efficiency. CONCLUSION: DendroSplit offers a clustering framework that is comparable to existing methods in terms of accuracy and speed but is novel in its emphasis on interpretabilty. We provide the full DendroSplit software package at https://github.com/jessemzhang/dendrosplit .

Assuntos

Bases de Dados Genéticas , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Algoritmos , Análise por Conglomerados , Humanos , Leucócitos Mononucleares/metabolismo , Padrões de Referência , Software

Information-optimal genome assembly via sparse read-overlap graphs.

Shomorony, Ilan; Kim, Samuel H; Courtade, Thomas A; Tse, David N C.

Bioinformatics ; 32(17): i494-i502, 2016 09 01.

Artigo em Inglês | MEDLINE | ID: mdl-27587667

RESUMO

MOTIVATION: In the context of third-generation long-read sequencing technologies, read-overlap-based approaches are expected to play a central role in the assembly step. A fundamental challenge in assembling from a read-overlap graph is that the true sequence corresponds to a Hamiltonian path on the graph, and, under most formulations, the assembly problem becomes NP-hard, restricting practical approaches to heuristics. In this work, we avoid this seemingly fundamental barrier by first setting the computational complexity issue aside, and seeking an algorithm that targets information limits In particular, we consider a basic feasibility question: when does the set of reads contain enough information to allow unambiguous reconstruction of the true sequence? RESULTS: Based on insights from this information feasibility question, we present an algorithm-the Not-So-Greedy algorithm-to construct a sparse read-overlap graph. Unlike most other assembly algorithms, Not-So-Greedy comes with a performance guarantee: whenever information feasibility conditions are satisfied, the algorithm reduces the assembly problem to an Eulerian path problem on the resulting graph, and can thus be solved in linear time. In practice, this theoretical guarantee translates into assemblies of higher quality. Evaluations on both simulated reads from real genomes and a PacBio Escherichia coli K12 dataset demonstrate that Not-So-Greedy compares favorably with standard string graph approaches in terms of accuracy of the resulting read-overlap graph and contig N50. AVAILABILITY: Available at github.com/samhykim/nsg CONTACT: courtade@eecs.berkeley.edu or dntse@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Análise de Sequência de DNA , Biologia Computacional/métodos , Genoma , Genoma Bacteriano , Metagenômica , Modelos Teóricos

Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.

Baharav, Tavor Z; Kamath, Govinda M; Tse, David N; Shomorony, Ilan.

Patterns (N Y) ; 1(6): 100081, 2020 Sep 11.

Artigo em Inglês | MEDLINE | ID: mdl-33205128

RESUMO

Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores.

Valid Post-clustering Differential Analysis for Single-Cell RNA-Seq.

Zhang, Jesse M; Kamath, Govinda M; Tse, David N.

Cell Syst ; 9(4): 383-392.e6, 2019 10 23.

Artigo em Inglês | MEDLINE | ID: mdl-31521605

RESUMO

Single-cell computational pipelines involve two critical steps: organizing cells (clustering) and identifying the markers driving this organization (differential expression analysis). State-of-the-art pipelines perform differential analysis after clustering on the same dataset. We observe that because clustering "forces" separation, reusing the same dataset generates artificially low p values and hence false discoveries. We introduce a valid post-clustering differential analysis framework, which corrects for this problem. We provide software at https://github.com/jessemzhang/tn_test.

Assuntos

Biologia Computacional/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Análise por Conglomerados , Conjuntos de Dados como Assunto , Perfilação da Expressão Gênica , Humanos , Viés de Seleção , Software

Optimal compressed representation of high throughput sequence data via light assembly.

Ginart, Antonio A; Hui, Joseph; Zhu, Kaiyuan; Numanagic, Ibrahim; Courtade, Thomas A; Sahinalp, S Cenk; Tse, David N.

Nat Commun ; 9(1): 566, 2018 02 08.

Artigo em Inglês | MEDLINE | ID: mdl-29422526

RESUMO

The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed.

Assuntos

Algoritmos , Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genoma/genética , Reprodutibilidade dos Testes , Software

Publisher Correction: Optimal compressed representation of high throughput sequence data via light assembly.

Ginart, Antonio A; Hui, Joseph; Zhu, Kaiyuan; Numanagic, Ibrahim; Courtade, Thomas A; Sahinalp, S Cenk; Tse, David N.

Nat Commun ; 9(1): 1755, 2018 04 26.

Artigo em Inglês | MEDLINE | ID: mdl-29700301

RESUMO

The original version of this Article contained errors in the affiliations of the authors Ibrahim Numanagic and Thomas A. Courtade, which were incorrectly given as 'Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA' and 'Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA', respectively. Also, the hyperlink for the source code in the Data Availability section was incorrectly given as https://github.iu.edu/kzhu/assembltrie , which links to a page that is not publicly accessible. The source code is publicly accessible at https://github.com/kyzhu/assembltrie . Furthermore, in the PDF version of the Article, the right-hand side of Figure 3 was inadvertently cropped. These errors have now been corrected in both the PDF and HTML versions of the Article.

Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts.

Ntranos, Vasilis; Kamath, Govinda M; Zhang, Jesse M; Pachter, Lior; Tse, David N.

Genome Biol ; 17(1): 112, 2016 05 26.

Artigo em Inglês | MEDLINE | ID: mdl-27230763

RESUMO

Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling, which limits their scope and generality. We propose a novel method that compares and clusters cells based on their transcript-compatibility read counts rather than on the transcript or gene quantifications used in standard analysis pipelines. In the reanalysis of two landmark yet disparate single-cell RNA-seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays.

Assuntos

Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Animais , Córtex Cerebral/metabolismo , Análise por Conglomerados , Hipocampo/metabolismo , Humanos , Camundongos , Mioblastos/metabolismo

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA