Búsqueda | Biblioteca Virtual en Salud Odontología. Uruguay

1.

The complete sequence and comparative analysis of ape sex chromosomes.

Makova, Kateryna D; Pickett, Brandon D; Harris, Robert S; Hartley, Gabrielle A; Cechova, Monika; Pal, Karol; Nurk, Sergey; Yoo, DongAhn; Li, Qiuhui; Hebbar, Prajna; McGrath, Barbara C; Antonacci, Francesca; Aubel, Margaux; Biddanda, Arjun; Borchers, Matthew; Bornberg-Bauer, Erich; Bouffard, Gerard G; Brooks, Shelise Y; Carbone, Lucia; Carrel, Laura; Carroll, Andrew; Chang, Pi-Chuan; Chin, Chen-Shan; Cook, Daniel E; Craig, Sarah J C; de Gennaro, Luciana; Diekhans, Mark; Dutra, Amalia; Garcia, Gage H; Grady, Patrick G S; Green, Richard E; Haddad, Diana; Hallast, Pille; Harvey, William T; Hickey, Glenn; Hillis, David A; Hoyt, Savannah J; Jeong, Hyeonsoo; Kamali, Kaivan; Pond, Sergei L Kosakovsky; LaPolice, Troy M; Lee, Charles; Lewis, Alexandra P; Loh, Yong-Hwee E; Masterson, Patrick; McGarvey, Kelly M; McCoy, Rajiv C; Medvedev, Paul; Miga, Karen H; Munson, Katherine M.

Nature ; 630(8016): 401-411, 2024 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-38811727

RESUMEN

Apes possess two sex chromosomes-the male-specific Y chromosome and the X chromosome, which is present in both males and females. The Y chromosome is crucial for male reproduction, with deletions being linked to infertility1. The X chromosome is vital for reproduction and cognition2. Variation in mating patterns and brain function among apes suggests corresponding differences in their sex chromosomes. However, owing to their repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the methodology developed for the telomere-to-telomere (T2T) human genome, we produced gapless assemblies of the X and Y chromosomes for five great apes (bonobo (Pan paniscus), chimpanzee (Pan troglodytes), western lowland gorilla (Gorilla gorilla gorilla), Bornean orangutan (Pongo pygmaeus) and Sumatran orangutan (Pongo abelii)) and a lesser ape (the siamang gibbon (Symphalangus syndactylus)), and untangled the intricacies of their evolution. Compared with the X chromosomes, the ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements-owing to the accumulation of lineage-specific ampliconic regions, palindromes, transposable elements and satellites. Many Y chromosome genes expand in multi-copy families and some evolve under purifying selection. Thus, the Y chromosome exhibits dynamic evolution, whereas the X chromosome is more stable. Mapping short-read sequencing data to these assemblies revealed diversity and selection patterns on sex chromosomes of more than 100 individual great apes. These reference assemblies are expected to inform human evolution and conservation genetics of non-human apes, all of which are endangered species.

Asunto(s)

Hominidae , Cromosoma X , Cromosoma Y , Animales , Femenino , Masculino , Gorilla gorilla/genética , Hominidae/genética , Hominidae/clasificación , Hylobatidae/genética , Pan paniscus/genética , Pan troglodytes/genética , Filogenia , Pongo abelii/genética , Pongo pygmaeus/genética , Telómero/genética , Cromosoma X/genética , Cromosoma Y/genética , Evolución Molecular , Variaciones en el Número de Copia de ADN/genética , Humanos , Especies en Peligro de Extinción , Estándares de Referencia

2.

Efficient mapping of accurate long reads in minimizer space with mapquik.

Ekim, Baris; Sahlin, Kristoffer; Medvedev, Paul; Berger, Bonnie; Chikhi, Rayan.

Genome Res ; 33(7): 1188-1197, 2023 07.

Artículo en Inglés | MEDLINE | ID: mdl-37399256

RESUMEN

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Humanos , Algoritmos , Análisis de Secuencia de ADN , Genoma Humano

3.

Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs.

Rahman, Amatur; Medvedev, Paul.

Genome Res ; 2022 Jul 27.

Artículo en Inglés | MEDLINE | ID: mdl-35896425

RESUMEN

Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources. More typical assembly efforts, on the other hand, still suffer both from misassemblies (joining sequences that should not be adjacent) and from underassemblies (not joining sequences that should be adjacent). To better understand the common algorithm-driven causes of these limitations, we investigated the unitig algorithm, which is a core algorithm at the heart of most assemblers. We prove that, contrary to popular belief, even when there are no sequencing errors, unitigs are not always safe (i.e., they are not guaranteed to be substrings of the sequenced genome). We also prove that the unitigs of a bidirected de Bruijn graph are different from those of a doubled de Bruijn graph and, contrary to our expectations, result in underassembly. Using experimental simulations, we then confirm that these two artifacts exist not only in theory but also in the output of widely used assemblers. In particular, when coverage is low, then even error-free data result in unsafe unitigs; also, unitigs may unnecessarily split palindromes in half if special care is not taken. To the best of our knowledge, this paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice.

4.

Data structures based on k-mers for querying large collections of sequencing data sets.

Marchet, Camille; Boucher, Christina; Puglisi, Simon J; Medvedev, Paul; Salson, Mikaël; Chikhi, Rayan.

Genome Res ; 31(1): 1-12, 2021 01.

Artículo en Inglés | MEDLINE | ID: mdl-33328168

RESUMEN

High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.

Asunto(s)

Algoritmos , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento , Reproducibilidad de los Resultados

5.

Markov chains improve the significance computation of overlapping genome annotations.

Gafurov, Askar; Brejová, Brona; Medvedev, Paul.

Bioinformatics ; 38(Suppl 1): i203-i211, 2022 06 24.

Artículo en Inglés | MEDLINE | ID: mdl-35758770

RESUMEN

MOTIVATION: Genome annotations are a common way to represent genomic features such as genes, regulatory elements or epigenetic modifications. The amount of overlap between two annotations is often used to ascertain if there is an underlying biological connection between them. In order to distinguish between true biological association and overlap by pure chance, a robust measure of significance is required. One common way to do this is to determine if the number of intervals in the reference annotation that intersect the query annotation is statistically significant. However, currently employed statistical frameworks are often either inefficient or inaccurate when computing P-values on the scale of the whole human genome. RESULTS: We show that finding the P-values under the typically used 'gold' null hypothesis is NP-hard. This motivates us to reformulate the null hypothesis using Markov chains. To be able to measure the fidelity of our Markovian null hypothesis, we develop a fast direct sampling algorithm to estimate the P-value under the gold null hypothesis. We then present an open-source software tool MCDP that computes the P-values under the Markovian null hypothesis in O(m2+n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations, respectively. Notably, MCDP runtime and memory usage are independent from the genome length, allowing it to outperform previous approaches in runtime and memory usage by orders of magnitude on human genome annotations, while maintaining the same level of accuracy. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/fmfi-compbio/mc-overlaps. All data for reproducibility are available at https://github.com/fmfi-compbio/mc-overlaps-reproducibility. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Genoma Humano , Programas Informáticos , Oro , Humanos , Cadenas de Markov , Reproducibilidad de los Resultados

6.

The minimizer Jaccard estimator is biased and inconsistent.

Belbasi, Mahdi; Blanca, Antonio; Harris, Robert S; Koslicki, David; Medvedev, Paul.

Bioinformatics ; 38(Suppl 1): i169-i176, 2022 06 24.

Artículo en Inglés | MEDLINE | ID: mdl-35758786

RESUMEN

MOTIVATION: Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. RESULTS: We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool. AVAILABILITY AND IMPLEMENTATION: Scripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Programas Informáticos

7.

The K-mer File Format: a standardized and compact disk representation of sets of k-mers.

Dufresne, Yoann; Lemane, Teo; Marijon, Pierre; Peterlongo, Pierre; Rahman, Amatur; Kokot, Marek; Medvedev, Paul; Deorowicz, Sebastian; Chikhi, Rayan.

Bioinformatics ; 38(18): 4423-4425, 2022 09 15.

Artículo en Inglés | MEDLINE | ID: mdl-35904548

RESUMEN

SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools. AVAILABILITY AND IMPLEMENTATION: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN , Discos Compactos

8.

Whole-genome sequence and assembly of the Javan gibbon (Hylobates moloch).

Escalona, Merly; VanCampen, Jake; Maurer, Nicholas W; Haukness, Marina; Okhovat, Mariam; Harris, Robert S; Watwood, Allison; Hartley, Gabrielle A; O'Neill, Rachel J; Medvedev, Paul; Makova, Kateryna D; Vollmers, Christopher; Carbone, Lucia; Green, Richard E.

J Hered ; 114(1): 35-43, 2023 03 16.

Artículo en Inglés | MEDLINE | ID: mdl-36146896

RESUMEN

The Javan gibbon, Hylobates moloch, is an endangered gibbon species restricted to the forest remnants of western and central Java, Indonesia, and one of the rarest of the Hylobatidae family. Hylobatids consist of 4 genera (Holoock, Hylobates, Symphalangus, and Nomascus) that are characterized by different numbers of chromosomes, ranging from 38 to 52. The underlying cause of this karyotype plasticity is not entirely understood, at least in part, due to the limited availability of genomic data. Here we present the first scaffold-level assembly for H. moloch using a combination of whole-genome Illumina short reads, 10X Chromium linked reads, PacBio, and Oxford Nanopore long reads and proximity-ligation data. This Hylobates genome represents a valuable new resource for comparative genomics studies in primates.

Asunto(s)

Genoma , Hylobates , Animales , Hylobates/genética , Bosques , Especies en Peligro de Extinción , Indonesia

9.

Dynamic evolution of great ape Y chromosomes.

Cechova, Monika; Vegesna, Rahulsimham; Tomaszkiewicz, Marta; Harris, Robert S; Chen, Di; Rangavittal, Samarth; Medvedev, Paul; Makova, Kateryna D.

Proc Natl Acad Sci U S A ; 117(42): 26273-26280, 2020 10 20.

Artículo en Inglés | MEDLINE | ID: mdl-33020265

RESUMEN

The mammalian male-specific Y chromosome plays a critical role in sex determination and male fertility. However, because of its repetitive and haploid nature, it is frequently absent from genome assemblies and remains enigmatic. The Y chromosomes of great apes represent a particular puzzle: their gene content is more similar between human and gorilla than between human and chimpanzee, even though human and chimpanzee share a more recent common ancestor. To solve this puzzle, here we constructed a dataset including Ys from all extant great ape genera. We generated assemblies of bonobo and orangutan Ys from short and long sequencing reads and aligned them with the publicly available human, chimpanzee, and gorilla Y assemblies. Analyzing this dataset, we found that the genus Pan, which includes chimpanzee and bonobo, experienced accelerated substitution rates. Pan also exhibited elevated gene death rates. These observations are consistent with high levels of sperm competition in Pan Furthermore, we inferred that the great ape common ancestor already possessed multicopy sequences homologous to most human and chimpanzee palindromes. Nonetheless, each species also acquired distinct ampliconic sequences. We also detected increased chromatin contacts between and within palindromes (from Hi-C data), likely facilitating gene conversion and structural rearrangements. Our results highlight the dynamic mode of Y chromosome evolution and open avenues for studies of male-specific dispersal in endangered great ape species.

Asunto(s)

Hominidae/genética , Cromosoma Y/genética , Animales , Evolución Biológica , Evolución Molecular , Conversión Génica , Gorilla gorilla/genética , Humanos , Pan paniscus/genética , Pan troglodytes/genética , Pongo/genética , Análisis de Secuencia de ADN

10.

Recombination Marks the Evolutionary Dynamics of a Recently Endogenized Retrovirus.

Yang, Lei; Malhotra, Raunaq; Chikhi, Rayan; Elleder, Daniel; Kaiser, Theodora; Rong, Jesse; Medvedev, Paul; Poss, Mary.

Mol Biol Evol ; 38(12): 5423-5436, 2021 12 09.

Artículo en Inglés | MEDLINE | ID: mdl-34480565

RESUMEN

All vertebrate genomes have been colonized by retroviruses along their evolutionary trajectory. Although endogenous retroviruses (ERVs) can contribute important physiological functions to contemporary hosts, such benefits are attributed to long-term coevolution of ERV and host because germline infections are rare and expansion is slow, and because the host effectively silences them. The genomes of several outbred species including mule deer (Odocoileus hemionus) are currently being colonized by ERVs, which provides an opportunity to study ERV dynamics at a time when few are fixed. We previously established the locus-specific distribution of cervid ERV (CrERV) in populations of mule deer. In this study, we determine the molecular evolutionary processes acting on CrERV at each locus in the context of phylogenetic origin, genome location, and population prevalence. A mule deer genome was de novo assembled from short- and long-insert mate pair reads and CrERV sequence generated at each locus. We report that CrERV composition and diversity have recently measurably increased by horizontal acquisition of a new retrovirus lineage. This new lineage has further expanded CrERV burden and CrERV genomic diversity by activating and recombining with existing CrERV. Resulting interlineage recombinants then endogenize and subsequently expand. CrERV loci are significantly closer to genes than expected if integration were random and gene proximity might explain the recent expansion of one recombinant CrERV lineage. Thus, in mule deer, retroviral colonization is a dynamic period in the molecular evolution of CrERV that also provides a burst of genomic diversity to the host population.

Asunto(s)

Ciervos , Retrovirus Endógenos , Animales , Evolución Biológica , Ciervos/genética , Retrovirus Endógenos/genética , Evolución Molecular , Filogenia , Recombinación Genética

11.

What do Eulerian and Hamiltonian cycles have to do with genome assembly?

Medvedev, Paul; Pop, Mihai.

PLoS Comput Biol ; 17(5): e1008928, 2021 05.

Artículo en Inglés | MEDLINE | ID: mdl-34014915

RESUMEN

Many students are taught about genome assembly using the dichotomy between the complexity of finding Eulerian and Hamiltonian cycles (easy versus hard, respectively). This dichotomy is sometimes used to motivate the use of de Bruijn graphs in practice. In this paper, we explain that while de Bruijn graphs have indeed been very useful, the reason has nothing to do with the complexity of the Hamiltonian and Eulerian cycle problems. We give 2 arguments. The first is that a genome reconstruction is never unique and hence an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. The second is that even if an arbitrary genome reconstruction was desired, one could do so in linear time in both the Eulerian and Hamiltonian paradigms.

Asunto(s)

Genoma , Algoritmos , Prueba de Estudio Conceptual

12.

Dosage regulation, and variation in gene expression and copy number of human Y chromosome ampliconic genes.

Vegesna, Rahulsimham; Tomaszkiewicz, Marta; Medvedev, Paul; Makova, Kateryna D.

PLoS Genet ; 15(9): e1008369, 2019 09.

Artículo en Inglés | MEDLINE | ID: mdl-31525193

RESUMEN

The Y chromosome harbors nine multi-copy ampliconic gene families expressed exclusively in testis. The gene copies within each family are >99% identical to each other, which poses a major challenge in evaluating their copy number. Recent studies demonstrated high variation in Y ampliconic gene copy number among humans. However, how this variation affects expression levels in human testis remains understudied. Here we developed a novel computational tool Ampliconic Copy Number Estimator (AmpliCoNE) that utilizes read sequencing depth information to estimate Y ampliconic gene copy number per family. We applied this tool to whole-genome sequencing data of 149 men with matched testis expression data whose samples are part of the Genotype-Tissue Expression (GTEx) project. We found that the Y ampliconic gene families with low copy number in humans were deleted or pseudogenized in non-human great apes, suggesting relaxation of functional constraints. Among the Y ampliconic gene families, higher copy number leads to higher expression. Within the Y ampliconic gene families, copy number does not influence gene expression, rather a high tolerance for variation in gene expression was observed in testis of presumably healthy men. No differences in gene expression levels were found among major Y haplogroups. Age positively correlated with expression levels of the HSFY and PRY gene families in the African subhaplogroup E1b, but not in the European subhaplogroups R1b and I1. We also found that expression of five Y ampliconic gene families is coordinated with that of their non-Y (i.e. X or autosomal) homologs. Indeed, five ampliconic gene families had consistently lower expression levels when compared to their non-Y homologs suggesting dosage regulation, while the HSFY family had higher expression levels than its X homolog and thus lacked dosage regulation.

Asunto(s)

Cromosomas Humanos Y/genética , Genes Ligados a Y/genética , Análisis de Secuencia de ADN/métodos , Animales , Cromosomas Humanos Y/fisiología , Variaciones en el Número de Copia de ADN/genética , Bases de Datos Genéticas , Compensación de Dosificación (Genética)/genética , Compensación de Dosificación (Genética)/fisiología , Epigénesis Genética/genética , Dosificación de Gen/genética , Expresión Génica/genética , Regulación de la Expresión Génica/genética , Genes Ligados a Y/fisiología , Factores de Transcripción del Choque Térmico/genética , Factores de Transcripción del Choque Térmico/metabolismo , Humanos , Masculino , Familia de Multigenes/genética , Testículo/metabolismo

13.

Modeling biological problems in computer science: a case study in genome assembly.

Medvedev, Paul.

Brief Bioinform ; 20(4): 1376-1383, 2019 07 19.

Artículo en Inglés | MEDLINE | ID: mdl-29394324

RESUMEN

As computer scientists working in bioinformatics/computational biology, we often face the challenge of coming up with an algorithm to answer a biological question. This occurs in many areas, such as variant calling, alignment and assembly. In this tutorial, we use the example of the genome assembly problem to demonstrate how to go from a question in the biological realm to a solution in the computer science realm. We show the modeling process step-by-step, including all the intermediate failed attempts. Please note this is not an introduction to how genome assembly algorithms work and, if treated as such, would be incomplete and unnecessarily long-winded.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Modelos Biológicos , Simulación por Computador , Genómica/estadística & datos numéricos , Secuenciación Completa del Genoma/estadística & datos numéricos

14.

Improved representation of sequence bloom trees.

Harris, Robert S; Medvedev, Paul.

Bioinformatics ; 36(3): 721-727, 2020 02 01.

Artículo en Inglés | MEDLINE | ID: mdl-31504157

RESUMEN

MOTIVATION: Algorithmic solutions to index and search biological databases are a fundamental part of bioinformatics, providing underlying components to many end-user tools. Inexpensive next generation sequencing has filled publicly available databases such as the Sequence Read Archive beyond the capacity of traditional indexing methods. Recently, the Sequence Bloom Tree (SBT) and its derivatives were proposed as a way to efficiently index such data for queries about transcript presence. RESULTS: We build on the SBT framework to construct the HowDe-SBT data structure, which uses a novel partitioning of information to reduce the construction and query time as well as the size of the index. Compared to previous SBT methods, on real RNA-seq data, HowDe-SBT can construct the index in less than 36% of the time and with 39% less space and can answer small-batch queries at least five times faster. We also develop a theoretical framework in which we can analyze and bound the space and query performance of HowDe-SBT compared to other SBT methods. AVAILABILITY AND IMPLEMENTATION: HowDe-SBT is available as a free open source program on https://github.com/medvedevgroup/HowDeSBT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Programas Informáticos , Árboles , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ARN

15.

Ten Simple Rules for writing algorithmic bioinformatics conference papers.

Medvedev, Paul.

PLoS Comput Biol ; 16(4): e1007742, 2020 04.

Artículo en Inglés | MEDLINE | ID: mdl-32240173

RESUMEN

Conferences are great venues for disseminating algorithmic bioinformatics results, but they unfortunately do not offer an opportunity to make major revisions in the way that journals do. As a result, it is not possible for authors to fix mistakes that might be easily correctable but nevertheless can cause the paper to be rejected. As a reviewer, I wish that I had the opportunity to tell the authors, "Hey, you forgot to do this really important thing, without which it is hard to accept the paper, but if you could go back and fix it, you might have a great paper for the conference." This lack of a back and forth can be especially problematic for first-time submitters or those from outside the field, e.g., biologists. In this article, I outline Ten Simple Rules to follow when writing an algorithmic bioinformatics conference paper to avoid having it rejected.

Asunto(s)

Difusión de la Información/métodos , Algoritmos , Biología Computacional/métodos , Congresos como Asunto , Humanos , Edición/normas , Escritura/normas

16.

Y and W Chromosome Assemblies: Approaches and Discoveries.

Tomaszkiewicz, Marta; Medvedev, Paul; Makova, Kateryna D.

Trends Genet ; 33(4): 266-282, 2017 04.

Artículo en Inglés | MEDLINE | ID: mdl-28236503

RESUMEN

Hundreds of vertebrate genomes have been sequenced and assembled to date. However, most sequencing projects have ignored the sex chromosomes unique to the heterogametic sex - Y and W - that are known as sex-limited chromosomes (SLCs). Indeed, haploid and repetitive Y chromosomes in species with male heterogamety (XY), and W chromosomes in species with female heterogamety (ZW), are difficult to sequence and assemble. Nevertheless, obtaining their sequences is important for understanding the intricacies of vertebrate genome function and evolution. Recent progress has been made towards the adaptation of next-generation sequencing (NGS) techniques to deciphering SLC sequences. We review here currently available methodology and results with regard to SLC sequencing and assembly. We focus on vertebrates, but bring in some examples from other taxa.

Asunto(s)

Evolución Molecular , Cromosomas Sexuales/genética , Procesos de Determinación del Sexo , Cromosoma Y/genética , Animales , Femenino , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Masculino

17.

Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics.

Sun, Chen; Medvedev, Paul.

Bioinformatics ; 35(3): 415-420, 2019 02 01.

Artículo en Inglés | MEDLINE | ID: mdl-30032192

RESUMEN

Motivation: Genotyping a set of variants from a database is an important step for identifying known genetic traits and disease-related variants within an individual. The growing size of variant databases as well as the high depth of sequencing data poses an efficiency challenge. In clinical applications, where time is crucial, alignment-based methods are often not fast enough. To fill the gap, Shajii et al. propose LAVA, an alignment-free genotyping method which is able to more quickly genotype single nucleotide polymorphisms (SNPs); however, there remains large room for improvements in running time and accuracy. Results: We present the VarGeno method for SNP genotyping from Illumina whole genome sequencing data. VarGeno builds upon LAVA by improving the speed of k-mer querying as well as the accuracy of the genotyping strategy. We evaluate VarGeno on several read datasets using different genotyping SNP lists. VarGeno performs 7-13 times faster than LAVA with similar memory usage, while improving accuracy. Availability and implementation: VarGeno is freely available at: https://github.com/medvedevgroup/vargeno. Supplementary information: Supplementary data are available at Bioinformatics online.

Asunto(s)

Técnicas de Genotipaje , Polimorfismo de Nucleótido Simple , Secuenciación Completa del Genoma , Genotipo

18.

DiscoverY: a classifier for identifying Y chromosome sequences in male assemblies.

Rangavittal, Samarth; Stopa, Natasha; Tomaszkiewicz, Marta; Sahlin, Kristoffer; Makova, Kateryna D; Medvedev, Paul.

BMC Genomics ; 20(1): 641, 2019 Aug 09.

Artículo en Inglés | MEDLINE | ID: mdl-31399045

RESUMEN

BACKGROUND: Although the Y chromosome plays an important role in male sex determination and fertility, it is currently understudied due to its haploid and repetitive nature. Methods to isolate Y-specific contigs from a whole-genome assembly broadly fall into two categories. The first involves retrieving Y-contigs using proportion sharing with a female, but such a strategy is prone to false positives in the absence of a high-quality, complete female reference. A second strategy uses the ratio of depth of coverage from male and female reads to select Y-contigs, but such a method requires high-depth sequencing of a female and cannot utilize existing female references. RESULTS: We develop a k-mer based method called DiscoverY, which combines proportion sharing with female with depth of coverage from male reads to classify contigs as Y-chromosomal. We evaluate the performance of DiscoverY on human and gorilla genomes, across different sequencing platforms including Illumina, 10X, and PacBio. In the cases where the male and female data are of high quality, DiscoverY has a high precision and recall and outperforms existing methods. For cases when a high quality female reference is not available, we quantify the effect of using draft reference or even just raw sequencing reads from a female. CONCLUSION: DiscoverY is an effective method to isolate Y-specific contigs from a whole-genome assembly. However, regions homologous to the X chromosome remain difficult to detect.

Asunto(s)

Cromosomas Humanos Y/genética , Análisis de Secuencia de ADN/métodos , Femenino , Haploidia , Humanos , Masculino , Análisis de Secuencia de ADN/economía , Factores de Tiempo

19.

A time- and cost-effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y.

Tomaszkiewicz, Marta; Rangavittal, Samarth; Cechova, Monika; Campos Sanchez, Rebeca; Fescemyer, Howard W; Harris, Robert; Ye, Danling; O'Brien, Patricia C M; Chikhi, Rayan; Ryder, Oliver A; Ferguson-Smith, Malcolm A; Medvedev, Paul; Makova, Kateryna D.

Genome Res ; 26(4): 530-40, 2016 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-26934921

RESUMEN

The mammalian Y Chromosome sequence, critical for studying male fertility and dispersal, is enriched in repeats and palindromes, and thus, is the most difficult component of the genome to assemble. Previously, expensive and labor-intensive BAC-based techniques were used to sequence the Y for a handful of mammalian species. Here, we present a much faster and more affordable strategy for sequencing and assembling mammalian Y Chromosomes of sufficient quality for most comparative genomics analyses and for conservation genetics applications. The strategy combines flow sorting, short- and long-read genome and transcriptome sequencing, and droplet digital PCR with novel and existing computational methods. It can be used to reconstruct sex chromosomes in a heterogametic sex of any species. We applied our strategy to produce a draft of the gorilla Y sequence. The resulting assembly allowed us to refine gene content, evaluate copy number of ampliconic gene families, locate species-specific palindromes, examine the repetitive element content, and produce sequence alignments with human and chimpanzee Y Chromosomes. Our results inform the evolution of the hominine (human, chimpanzee, and gorilla) Y Chromosomes. Surprisingly, we found the gorilla Y Chromosome to be similar to the human Y Chromosome, but not to the chimpanzee Y Chromosome. Moreover, we have utilized the assembled gorilla Y Chromosome sequence to design genetic markers for studying the male-specific dispersal of this endangered species.

Asunto(s)

Biología Computacional , Secuenciación de Nucleótidos de Alto Rendimiento , Mamíferos/genética , Cromosoma Y , Animales , Biología Computacional/métodos , Reordenamiento Génico , Genoma , Genómica , Gorilla gorilla/genética , Humanos , Secuencias Invertidas Repetidas , Masculino , Repeticiones de Microsatélite , Pan troglodytes/genética , Secuencias Repetitivas de Ácidos Nucleicos , Análisis de Secuencia de ADN

20.

RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly.

Rangavittal, Samarth; Harris, Robert S; Cechova, Monika; Tomaszkiewicz, Marta; Chikhi, Rayan; Makova, Kateryna D; Medvedev, Paul.

Bioinformatics ; 34(7): 1125-1131, 2018 04 01.

Artículo en Inglés | MEDLINE | ID: mdl-29194476

RESUMEN

Motivation: The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results: We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation: Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact: kmakova@bx.psu.edu or pashadag@cse.psu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Cromosoma Y , Algoritmos , Animales , Cromosomas de los Mamíferos , Genómica/métodos , Gorilla gorilla/genética , Humanos , Masculino , Mamíferos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA