Pesquisa | Portal Regional da BVS

Fast and compact matching statistics analytics.

Cunial, Fabio; Denas, Olgert; Belazzougui, Djamal.

Bioinformatics ; 38(7): 1838-1845, 2022 03 28.

Artigo em Inglês | MEDLINE | ID: mdl-35134833

RESUMO

MOTIVATION: Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. RESULTS: We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. AVAILABILITY AND IMPLEMENTATION: Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https://www.ncbi.nlm.nih.gov/genome and in the International Genome Sample Resource (IGSR) at https://www.internationalgenome.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Análise de Sequência de DNA/métodos , Genômica/métodos , Genoma

Genome-wide comparative analysis reveals human-mouse regulatory landscape and evolution.

Denas, Olgert; Sandstrom, Richard; Cheng, Yong; Beal, Kathryn; Herrero, Javier; Hardison, Ross C; Taylor, James.

BMC Genomics ; 16: 87, 2015 Feb 14.

Artigo em Inglês | MEDLINE | ID: mdl-25765714

RESUMO

BACKGROUND: Because species-specific gene expression is driven by species-specific regulation, understanding the relationship between sequence and function of the regulatory regions in different species will help elucidate how differences among species arise. Despite active experimental and computational research, relationships among sequence, conservation, and function are still poorly understood. RESULTS: We compared transcription factor occupied segments (TFos) for 116 human and 35 mouse TFs in 546 human and 125 mouse cell types and tissues from the Human and the Mouse ENCODE projects. We based the map between human and mouse TFos on a one-to-one nucleotide cross-species mapper, bnMapper, that utilizes whole genome alignments (WGA). Our analysis shows that TFos are under evolutionary constraint, but a substantial portion (25.1% of mouse and 25.85% of human on average) of the TFos does not have a homologous sequence on the other species; this portion varies among cell types and TFs. Furthermore, 47.67% and 57.01% of the homologous TFos sequence shows binding activity on the other species for human and mouse respectively. However, 79.87% and 69.22% is repurposed such that it binds the same TF in different cells or different TFs in the same cells. Remarkably, within the set of repurposed TFos, the corresponding genome regions in the other species are preferred locations of novel TFos. These events suggest exaptation of some functional regulatory sequences into new function. Despite TFos repurposing, we did not find substantial changes in their predicted target genes, suggesting that CRMs buffer evolutionary events allowing little or no change in the TFos - target gene associations. Thus, the small portion of TFos with strictly conserved occupancy underestimates the degree of conservation of regulatory interactions. CONCLUSION: We mapped regulatory sequences from an extensive number of TFs and cell types between human and mouse using WGA. A comparative analysis of this correspondence unveiled the extent of the shared regulatory sequence across TFs and cell types under study. Importantly, a large part of the shared regulatory sequence is repurposed on the other species. This sequence, fueled by turnover events, provides a strong case for exaptation in regulatory elements.

Assuntos

Evolução Biológica , Genoma , Fatores de Transcrição/genética , Animais , Sítios de Ligação , Hibridização Genômica Comparativa , Humanos , Camundongos , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição/metabolismo

A comparative encyclopedia of DNA elements in the mouse genome.

Yue, Feng; Cheng, Yong; Breschi, Alessandra; Vierstra, Jeff; Wu, Weisheng; Ryba, Tyrone; Sandstrom, Richard; Ma, Zhihai; Davis, Carrie; Pope, Benjamin D; Shen, Yin; Pervouchine, Dmitri D; Djebali, Sarah; Thurman, Robert E; Kaul, Rajinder; Rynes, Eric; Kirilusha, Anthony; Marinov, Georgi K; Williams, Brian A; Trout, Diane; Amrhein, Henry; Fisher-Aylor, Katherine; Antoshechkin, Igor; DeSalvo, Gilberto; See, Lei-Hoon; Fastuca, Meagan; Drenkow, Jorg; Zaleski, Chris; Dobin, Alex; Prieto, Pablo; Lagarde, Julien; Bussotti, Giovanni; Tanzer, Andrea; Denas, Olgert; Li, Kanwei; Bender, M A; Zhang, Miaohua; Byron, Rachel; Groudine, Mark T; McCleary, David; Pham, Long; Ye, Zhen; Kuan, Samantha; Edsall, Lee; Wu, Yi-Chieh; Rasmussen, Matthew D; Bansal, Mukul S; Kellis, Manolis; Keller, Cheryl A; Morrissey, Christapher S.

Nature ; 515(7527): 355-64, 2014 Nov 20.

Artigo em Inglês | MEDLINE | ID: mdl-25409824

RESUMO

The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.

Assuntos

Genoma/genética , Genômica , Camundongos/genética , Anotação de Sequência Molecular , Animais , Linhagem da Célula/genética , Cromatina/genética , Cromatina/metabolismo , Sequência Conservada/genética , Replicação do DNA/genética , Desoxirribonuclease I/metabolismo , Regulação da Expressão Gênica/genética , Redes Reguladoras de Genes/genética , Estudo de Associação Genômica Ampla , Humanos , RNA/genética , Sequências Reguladoras de Ácido Nucleico/genética , Especificidade da Espécie , Fatores de Transcrição/metabolismo , Transcriptoma/genética

Topologically associating domains are stable units of replication-timing regulation.

Pope, Benjamin D; Ryba, Tyrone; Dileep, Vishnu; Yue, Feng; Wu, Weisheng; Denas, Olgert; Vera, Daniel L; Wang, Yanli; Hansen, R Scott; Canfield, Theresa K; Thurman, Robert E; Cheng, Yong; Gülsoy, Günhan; Dennis, Jonathan H; Snyder, Michael P; Stamatoyannopoulos, John A; Taylor, James; Hardison, Ross C; Kahveci, Tamer; Ren, Bing; Gilbert, David M.

Nature ; 515(7527): 402-5, 2014 Nov 20.

Artigo em Inglês | MEDLINE | ID: mdl-25409831

RESUMO

Eukaryotic chromosomes replicate in a temporal order known as the replication-timing program. In mammals, replication timing is cell-type-specific with at least half the genome switching replication timing during development, primarily in units of 400-800 kilobases ('replication domains'), whose positions are preserved in different cell types, conserved between species, and appear to confine long-range effects of chromosome rearrangements. Early and late replication correlate, respectively, with open and closed three-dimensional chromatin compartments identified by high-resolution chromosome conformation capture (Hi-C), and, to a lesser extent, late replication correlates with lamina-associated domains (LADs). Recent Hi-C mapping has unveiled substructure within chromatin compartments called topologically associating domains (TADs) that are largely conserved in their positions between cell types and are similar in size to replication domains. However, TADs can be further sub-stratified into smaller domains, challenging the significance of structures at any particular scale. Moreover, attempts to reconcile TADs and LADs to replication-timing data have not revealed a common, underlying domain structure. Here we localize boundaries of replication domains to the early-replicating border of replication-timing transitions and map their positions in 18 human and 13 mouse cell types. We demonstrate that, collectively, replication domain boundaries share a near one-to-one correlation with TAD boundaries, whereas within a cell type, adjacent TADs that replicate at similar times obscure replication domain boundaries, largely accounting for the previously reported lack of alignment. Moreover, cell-type-specific replication timing of TADs partitions the genome into two large-scale sub-nuclear compartments revealing that replication-timing transitions are indistinguishable from late-replicating regions in chromatin composition and lamina association and accounting for the reduced correlation of replication timing to LADs and heterochromatin. Our results reconcile cell-type-specific sub-nuclear compartmentalization and replication timing with developmentally stable structural domains and offer a unified model for large-scale chromosome structure and function.

Assuntos

Cromatina/química , Cromatina/genética , Período de Replicação do DNA , DNA/biossíntese , Animais , Compartimento Celular , Cromatina/metabolismo , Montagem e Desmontagem da Cromatina , DNA/genética , Genoma/genética , Heterocromatina/química , Heterocromatina/genética , Heterocromatina/metabolismo , Humanos , Camundongos , Especificidade de Órgãos , Fatores de Tempo

An encyclopedia of mouse DNA elements (Mouse ENCODE).

Stamatoyannopoulos, John A; Snyder, Michael; Hardison, Ross; Ren, Bing; Gingeras, Thomas; Gilbert, David M; Groudine, Mark; Bender, Michael; Kaul, Rajinder; Canfield, Theresa; Giste, Erica; Johnson, Audra; Zhang, Mia; Balasundaram, Gayathri; Byron, Rachel; Roach, Vaughan; Sabo, Peter J; Sandstrom, Richard; Stehling, A Sandra; Thurman, Robert E; Weissman, Sherman M; Cayting, Philip; Hariharan, Manoj; Lian, Jin; Cheng, Yong; Landt, Stephen G; Ma, Zhihai; Wold, Barbara J; Dekker, Job; Crawford, Gregory E; Keller, Cheryl A; Wu, Weisheng; Morrissey, Christopher; Kumar, Swathi A; Mishra, Tejaswini; Jain, Deepti; Byrska-Bishop, Marta; Blankenberg, Daniel; Lajoie, Bryan R; Jain, Gaurav; Sanyal, Amartya; Chen, Kaun-Bei; Denas, Olgert; Taylor, James; Blobel, Gerd A; Weiss, Mitchell J; Pimkin, Max; Deng, Wulan; Marinov, Georgi K; Williams, Brian A.

Genome Biol ; 13(8): 418, 2012 Aug 13.

Artigo em Inglês | MEDLINE | ID: mdl-22889292

RESUMO

To complement the human Encyclopedia of DNA Elements (ENCODE) project and to enable a broad range of mouse genomics efforts, the Mouse ENCODE Consortium is applying the same experimental pipelines developed for human ENCODE to annotate the mouse genome.

Assuntos

Bases de Dados de Ácidos Nucleicos , Genômica , Camundongos/genética , Anotação de Sequência Molecular , Animais , Genoma , Genoma Humano , Humanos , Internet

The genome sequence of the leaf-cutter ant Atta cephalotes reveals insights into its obligate symbiotic lifestyle.

Suen, Garret; Teiling, Clotilde; Li, Lewyn; Holt, Carson; Abouheif, Ehab; Bornberg-Bauer, Erich; Bouffard, Pascal; Caldera, Eric J; Cash, Elizabeth; Cavanaugh, Amy; Denas, Olgert; Elhaik, Eran; Favé, Marie-Julie; Gadau, Jürgen; Gibson, Joshua D; Graur, Dan; Grubbs, Kirk J; Hagen, Darren E; Harkins, Timothy T; Helmkampf, Martin; Hu, Hao; Johnson, Brian R; Kim, Jay; Marsh, Sarah E; Moeller, Joseph A; Muñoz-Torres, Mónica C; Murphy, Marguerite C; Naughton, Meredith C; Nigam, Surabhi; Overson, Rick; Rajakumar, Rajendhran; Reese, Justin T; Scott, Jarrod J; Smith, Chris R; Tao, Shu; Tsutsui, Neil D; Viljakainen, Lumi; Wissler, Lothar; Yandell, Mark D; Zimmer, Fabian; Taylor, James; Slater, Steven C; Clifton, Sandra W; Warren, Wesley C; Elsik, Christine G; Smith, Christopher D; Weinstock, George M; Gerardo, Nicole M; Currie, Cameron R.

PLoS Genet ; 7(2): e1002007, 2011 Feb 10.

Artigo em Inglês | MEDLINE | ID: mdl-21347285

RESUMO

Leaf-cutter ants are one of the most important herbivorous insects in the Neotropics, harvesting vast quantities of fresh leaf material. The ants use leaves to cultivate a fungus that serves as the colony's primary food source. This obligate ant-fungus mutualism is one of the few occurrences of farming by non-humans and likely facilitated the formation of their massive colonies. Mature leaf-cutter ant colonies contain millions of workers ranging in size from small garden tenders to large soldiers, resulting in one of the most complex polymorphic caste systems within ants. To begin uncovering the genomic underpinnings of this system, we sequenced the genome of Atta cephalotes using 454 pyrosequencing. One prediction from this ant's lifestyle is that it has undergone genetic modifications that reflect its obligate dependence on the fungus for nutrients. Analysis of this genome sequence is consistent with this hypothesis, as we find evidence for reductions in genes related to nutrient acquisition. These include extensive reductions in serine proteases (which are likely unnecessary because proteolysis is not a primary mechanism used to process nutrients obtained from the fungus), a loss of genes involved in arginine biosynthesis (suggesting that this amino acid is obtained from the fungus), and the absence of a hexamerin (which sequesters amino acids during larval development in other insects). Following recent reports of genome sequences from other insects that engage in symbioses with beneficial microbes, the A. cephalotes genome provides new insights into the symbiotic lifestyle of this ant and advances our understanding of host-microbe symbioses.

Assuntos

Formigas/fisiologia , Genoma de Inseto/genética , Folhas de Planta/fisiologia , Simbiose , Animais , Formigas/genética , Arginina/genética , Arginina/metabolismo , Sequência de Bases , Fungos/genética , Proteínas de Insetos/genética , Proteínas de Insetos/metabolismo , Análise de Sequência de DNA , Serina Proteases/genética , Serina Proteases/metabolismo

Efficient tools for comparative substring analysis.

Apostolico, Alberto; Denas, Olgert; Dress, Andreas.

J Biotechnol ; 149(3): 120-6, 2010 Sep 01.

Artigo em Inglês | MEDLINE | ID: mdl-20682467

RESUMO

This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset m aximum length K (including K=infinity). Remarkably, although there may be Theta(L(2)) distinct strings that occur in a given sequence of length L (and Theta(KL) of length k< or =K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in O(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K). A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper.

Assuntos

Genômica , Algoritmos , Modelos Teóricos

Fast algorithms for computing sequence distances by exhaustive substring composition.

Apostolico, Alberto; Denas, Olgert.

Algorithms Mol Biol ; 3: 13, 2008 Oct 28.

Artigo em Inglês | MEDLINE | ID: mdl-18957094

RESUMO

The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA