Pesquisa | Biblioteca Virtual em Saúde

Indexing and searching petabase-scale nucleotide resources.

Shiryev, Sergey A; Agarwala, Richa.

Nat Methods ; 21(6): 994-1002, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38755321

RESUMO

Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.

Assuntos

Software , Nucleotídeos/genética , Humanos , Bases de Dados Genéticas , Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos , Algoritmos

Single haplotype assembly of the human genome from a hydatidiform mole.

Steinberg, Karyn Meltz; Schneider, Valerie A; Graves-Lindsay, Tina A; Fulton, Robert S; Agarwala, Richa; Huddleston, John; Shiryev, Sergey A; Morgulis, Aleksandr; Surti, Urvashi; Warren, Wesley C; Church, Deanna M; Eichler, Evan E; Wilson, Richard K.

Genome Res ; 24(12): 2066-76, 2014 12.

Artigo em Inglês | MEDLINE | ID: mdl-25373144

RESUMO

A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.

Assuntos

Genoma Humano , Haplótipos , Mola Hidatiforme/genética , Alelos , Mapeamento Cromossômico , Cromossomos Artificiais Bacterianos , Biologia Computacional/métodos , Feminino , Genômica/métodos , Heterozigoto , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Polimorfismo de Nucleotídeo Único , Gravidez , Sequências Repetitivas de Ácido Nucleico , Duplicações Segmentares Genômicas , Análise de Sequência de DNA

X. couchianus and X. hellerii genome models provide genomic variation insight among Xiphophorus species.

Shen, Yingjia; Chalopin, Domitille; Garcia, Tzintzuni; Boswell, Mikki; Boswell, William; Shiryev, Sergey A; Agarwala, Richa; Volff, Jean-Nicolas; Postlethwait, John H; Schartl, Manfred; Minx, Patrick; Warren, Wesley C; Walter, Ronald B.

BMC Genomics ; 17: 37, 2016 Jan 07.

Artigo em Inglês | MEDLINE | ID: mdl-26742787

RESUMO

BACKGROUND: Xiphophorus fishes are represented by 26 live-bearing species of tropical fish that express many attributes (e.g., viviparity, genetic and phenotypic variation, ecological adaptation, varied sexual developmental mechanisms, ability to produce fertile interspecies hybrids) that have made attractive research models for over 85 years. Use of various interspecies hybrids to investigate the genetics underlying spontaneous and induced tumorigenesis has resulted in the development and maintenance of pedigreed Xiphophorus lines specifically bred for research. The recent availability of the X. maculatus reference genome assembly now provides unprecedented opportunities for novel and exciting comparative research studies among Xiphophorus species. RESULTS: We present sequencing, assembly and annotation of two new genomes representing Xiphophorus couchianus and Xiphophorus hellerii. The final X. couchianus and X. hellerii assemblies have total sizes of 708 Mb and 734 Mb and correspond to 98 % and 102 % of the X. maculatus Jp 163 A genome size, respectively. The rates of single nucleotide change range from 1 per 52 bp to 1 per 69 bp among the three genomes and the impact of putatively damaging variants are presented. In addition, a survey of transposable elements allowed us to deduce an ancestral TE landscape, uncovered potential active TEs and document a recent burst of TEs during evolution of this genus. CONCLUSIONS: Two new Xiphophorus genomes and their corresponding transcriptomes were efficiently assembled, the former using a novel guided assembly approach. Three assembled genome sequences within this single vertebrate order of new world live-bearing fishes will accelerate our understanding of relationship between environmental adaptation and genome evolution. In addition, these genome resources provide capability to determine allele specific gene regulation among interspecies hybrids produced by crossing any of the three species that are known to produce progeny predisposed to tumor development.

Assuntos

Ciprinodontiformes/genética , Variação Genética , Genoma , Transcriptoma/genética , Animais , Regulação da Expressão Gênica , Genômica , Especificidade da Espécie

Finding Candida auris in public metagenomic repositories.

Mario-Vasquez, Jorge E; Bagal, Ujwal R; Lowe, Elijah; Morgulis, Aleksandr; Phan, John; Sexton, D Joseph; Shiryev, Sergey; Slatkevicius, Rytis; Welsh, Rory; Litvintseva, Anastasia P; Blumberg, Matthew; Agarwala, Richa; Chow, Nancy A.

PLoS One ; 19(1): e0291406, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38241320

RESUMO

Candida auris is a newly emerged multidrug-resistant fungus capable of causing invasive infections with high mortality. Despite intense efforts to understand how this pathogen rapidly emerged and spread worldwide, its environmental reservoirs are poorly understood. Here, we present a collaborative effort between the U.S. Centers for Disease Control and Prevention, the National Center for Biotechnology Information, and GridRepublic (a volunteer computing platform) to identify C. auris sequences in publicly available metagenomic datasets. We developed the MetaNISH pipeline that uses SRPRISM to align sequences to a set of reference genomes and computes a score for each reference genome. We used MetaNISH to scan ~300,000 SRA metagenomic runs from 2010 onwards and identified five datasets containing C. auris reads. Finally, GridRepublic has implemented a prospective C. auris molecular monitoring system using MetaNISH and volunteer computing.

Assuntos

Candida , Candidíase , Humanos , Candida/genética , Candidíase/microbiologia , Candida auris , Estudos Prospectivos , Metagenômica , Antifúngicos/uso terapêutico

Improved BLAST searches using longer words for protein seeding.

Shiryev, Sergey A; Papadopoulos, Jason S; Schäffer, Alejandro A; Agarwala, Richa.

Bioinformatics ; 23(21): 2949-51, 2007 Nov 01.

Artigo em Inglês | MEDLINE | ID: mdl-17921491

RESUMO

MOTIVATION: The blastp and tblastn modules of BLAST are widely used methods for searching protein queries against protein and nucleotide databases, respectively. One heuristic used in BLAST is to consider only database sequences that contain a high-scoring match of length at most 5 to the query. We implemented the capability to use words of length 6 or 7. We demonstrate an improved trade-off between running time and retrieval accuracy, controlled by the score threshold used for short word matches. For example, the running time can be reduced by 20-30% while achieving ROC (receiver operator characteristic) scores similar to those obtained with current default parameters. AVAILABILITY: The option to use long words is in the NCBI C and C++ toolkit code for BLAST, starting with version 2.2.16 of blastall. A Linux executable used to produce the results herein is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/protein_longwords

Assuntos

Sistemas de Gerenciamento de Base de Dados , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação/métodos , Proteínas/química , Proteínas/genética , Alinhamento de Sequência/métodos , Interface Usuário-Computador , Algoritmos , Gráficos por Computador

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA