Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 37
Filter
Add more filters










Publication year range
1.
Science ; 383(6690): eabn3263, 2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38422184

ABSTRACT

Vocal production learning ("vocal learning") is a convergently evolved trait in vertebrates. To identify brain genomic elements associated with mammalian vocal learning, we integrated genomic, anatomical, and neurophysiological data from the Egyptian fruit bat (Rousettus aegyptiacus) with analyses of the genomes of 215 placental mammals. First, we identified a set of proteins evolving more slowly in vocal learners. Then, we discovered a vocal motor cortical region in the Egyptian fruit bat, an emergent vocal learner, and leveraged that knowledge to identify active cis-regulatory elements in the motor cortex of vocal learners. Machine learning methods applied to motor cortex open chromatin revealed 50 enhancers robustly associated with vocal learning whose activity tended to be lower in vocal learners. Our research implicates convergent losses of motor cortex regulatory elements in mammalian vocal learning evolution.


Subject(s)
Enhancer Elements, Genetic , Eutheria , Evolution, Molecular , Gene Expression Regulation , Motor Cortex , Motor Neurons , Proteins , Vocalization, Animal , Animals , Chiroptera/genetics , Chiroptera/physiology , Vocalization, Animal/physiology , Motor Cortex/cytology , Motor Cortex/physiology , Chromatin/metabolism , Motor Neurons/physiology , Larynx/physiology , Epigenesis, Genetic , Genome , Proteins/genetics , Proteins/metabolism , Amino Acid Sequence , Eutheria/genetics , Eutheria/physiology , Machine Learning
2.
Genome Biol ; 24(1): 217, 2023 10 02.
Article in English | MEDLINE | ID: mdl-37784172

ABSTRACT

Interactive graphical genome browsers are essential tools in genomics, but they do not contain all the recent genome assemblies. We create Genome Archive (GenArk) collection of UCSC Genome Browsers from NCBI assemblies. Built on our established track hub system, this enables fast visualization of annotations. Assemblies come with gene models, repeat masks, BLAT, and in silico PCR. Users can add annotations via track hubs and custom tracks. We can bulk-import third-party resources, demonstrated with TOGA and Ensembl gene models for hundreds of assemblies.Three thousand two hundred sixty-nine GenArk assemblies are listed at https://hgdownload.soe.ucsc.edu/hubs/ and can be searched for on the Genome Browser gateway page.


Subject(s)
Genome , Software , Genomics , Archives , Nucleic Acid Amplification Techniques , Databases, Genetic , Internet
3.
Res Sq ; 2023 Apr 03.
Article in English | MEDLINE | ID: mdl-37066427

ABSTRACT

Interactive graphical genome browsers are essential tools for biologists working with DNA sequences. Although tens of thousands of new genome assemblies have become available over the last decade, accessibility is limited by the work involved in manually creating browsers and curating annotations. The results can push the limits of data storage infrastructure. To facilitate managing this increasing number of genome assemblies, we created the Genome Archive (GenArk) collection of UCSC Genome Browsers from assemblies hosted at NCBI(1). Built on our established assembly hub system, this collection enables fast, on-demand visualization of chromosome regions without requiring a database server. Available annotations include gene models, some mapped through whole-genome alignments, repeat masks, GC content, and others. We also modified our popular BLAT(2) aligner and in-silico PCR to support a large number of genomes using limited RAM. Users can upload additional annotations themselves via track hubs(3) and custom tracks. We can import more annotations in bulk from third-party resources, demonstrated here with TOGA(4) gene models. 2,430 GenArk assemblies are listed at https://hgdownload.soe.ucsc.edu/hubs/ and can be found by searching on the main UCSC gateway page. We will continue to add human high-quality assemblies and for other organisms, we are looking forward to receiving requests from the research community for ever more browsers and whole-genome alignments via http://genome.ucsc.edu/assemblyRequest.html.

4.
Science ; 380(6643): eabn3943, 2023 04 28.
Article in English | MEDLINE | ID: mdl-37104599

ABSTRACT

Zoonomia is the largest comparative genomics resource for mammals produced to date. By aligning genomes for 240 species, we identify bases that, when mutated, are likely to affect fitness and alter disease risk. At least 332 million bases (~10.7%) in the human genome are unusually conserved across species (evolutionarily constrained) relative to neutrally evolving repeats, and 4552 ultraconserved elements are nearly perfectly conserved. Of 101 million significantly constrained single bases, 80% are outside protein-coding exons and half have no functional annotations in the Encyclopedia of DNA Elements (ENCODE) resource. Changes in genes and regulatory elements are associated with exceptional mammalian traits, such as hibernation, that could inform therapeutic development. Earth's vast and imperiled biodiversity offers distinctive power for identifying genetic variants that affect genome function and organismal phenotypes.


Subject(s)
Eutheria , Evolution, Molecular , Animals , Female , Humans , Conserved Sequence/genetics , Eutheria/genetics , Genome, Human
5.
Genome Res ; 31(11): 2035-2049, 2021 11.
Article in English | MEDLINE | ID: mdl-34667117

ABSTRACT

Vocal learning, the ability to imitate sounds from conspecifics and the environment, is a key component of human spoken language and learned song in three independently evolved avian groups-oscine songbirds, parrots, and hummingbirds. Humans and each of these three bird clades exhibit specialized behavioral, neuroanatomical, and brain gene expression convergence related to vocal learning, speech, and song. To understand the evolutionary basis of vocal learning gene specializations and convergence, we searched for and identified accelerated genomic regions (ARs), a marker of positive selection, specific to vocal learning birds. We found avian vocal learner-specific ARs, and they were enriched in noncoding regions near genes with known speech functions or brain gene expression specializations in humans and vocal learning birds, including FOXP2, NEUROD6, ZEB2, and MEF2C, and near genes with major neurodevelopmental functions, including NR2F1, NRP2, and BCL11B We also found enrichment near the SFARI class S genes associated with syndromic vocal communication forms of autism spectrum disorders. These findings reveal strong candidate noncoding regions near genes for the evolutionary adaptations that distinguish vocal learning species from their close vocal nonlearning relatives and provide further evidence of molecular convergence between birdsong and human spoken language.


Subject(s)
Songbirds , Speech , Animals , Brain/metabolism , Genomics , Humans , Learning , Repressor Proteins/metabolism , Songbirds/genetics , Tumor Suppressor Proteins/metabolism , Vocalization, Animal
7.
Nucleic Acids Res ; 49(D1): D916-D923, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33270111

ABSTRACT

The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.


Subject(s)
COVID-19/prevention & control , Computational Biology/methods , Databases, Genetic , Genomics/methods , Molecular Sequence Annotation/methods , SARS-CoV-2/genetics , Animals , COVID-19/epidemiology , COVID-19/virology , Epidemics , Humans , Internet , Mice , Pseudogenes/genetics , RNA, Long Noncoding/genetics , SARS-CoV-2/metabolism , SARS-CoV-2/physiology , Transcription, Genetic/genetics
8.
Science ; 370(6523)2020 12 18.
Article in English | MEDLINE | ID: mdl-33335035

ABSTRACT

The rhesus macaque (Macaca mulatta) is the most widely studied nonhuman primate (NHP) in biomedical research. We present an updated reference genome assembly (Mmul_10, contig N50 = 46 Mbp) that increases the sequence contiguity 120-fold and annotate it using 6.5 million full-length transcripts, thus improving our understanding of gene content, isoform diversity, and repeat organization. With the improved assembly of segmental duplications, we discovered new lineage-specific genes and expanded gene families that are potentially informative in studies of evolution and disease susceptibility. Whole-genome sequencing (WGS) data from 853 rhesus macaques identified 85.7 million single-nucleotide variants (SNVs) and 10.5 million indel variants, including potentially damaging variants in genes associated with human autism and developmental delay, providing a framework for developing noninvasive NHP models of human disease.


Subject(s)
Genetic Predisposition to Disease , Genome , Macaca mulatta/genetics , Polymorphism, Single Nucleotide , Animals , Genetic Variation , Humans , Molecular Sequence Annotation , Whole Genome Sequencing
9.
Nature ; 587(7833): 246-251, 2020 11.
Article in English | MEDLINE | ID: mdl-33177663

ABSTRACT

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1-3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


Subject(s)
Genome/genetics , Genomics/methods , Sequence Alignment/methods , Software , Vertebrates/genetics , Amnion , Animals , Computer Simulation , Genomics/standards , Haplotypes , Humans , Quality Control , Sequence Alignment/standards , Software/standards
10.
Nature ; 587(7833): 252-257, 2020 11.
Article in English | MEDLINE | ID: mdl-33177665

ABSTRACT

Whole-genome sequencing projects are increasingly populating the tree of life and characterizing biodiversity1-4. Sparse taxon sampling has previously been proposed to confound phylogenetic inference5, and captures only a fraction of the genomic diversity. Here we report a substantial step towards the dense representation of avian phylogenetic and molecular diversity, by analysing 363 genomes from 92.4% of bird families-including 267 newly sequenced genomes produced for phase II of the Bird 10,000 Genomes (B10K) Project. We use this comparative genome dataset in combination with a pipeline that leverages a reference-free whole-genome alignment to identify orthologous regions in greater numbers than has previously been possible and to recognize genomic novelties in particular bird lineages. The densely sampled alignment provides a single-base-pair map of selection, has more than doubled the fraction of bases that are confidently predicted to be under conservation and reveals extensive patterns of weak selection in predominantly non-coding DNA. Our results demonstrate that increasing the diversity of genomes used in comparative studies can reveal more shared and lineage-specific variation, and improve the investigation of genomic characteristics. We anticipate that this genomic resource will offer new perspectives on evolutionary processes in cross-species comparative analyses and assist in efforts to conserve species.


Subject(s)
Birds/classification , Birds/genetics , Genome/genetics , Genomics/methods , Genomics/standards , Phylogeny , Animals , Chickens/genetics , Conservation of Natural Resources , Datasets as Topic , Finches/genetics , Humans , Selection, Genetic/genetics , Synteny/genetics
11.
Nat Biotechnol ; 38(9): 1044-1053, 2020 09.
Article in English | MEDLINE | ID: mdl-32686750

ABSTRACT

De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid human genome assembly, we present Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN. Using a single PromethION nanopore sequencer and our toolkit, we assembled 11 highly contiguous human genomes de novo in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values and 6.5× coverage in reads >100 kb using three flow cells per sample. Shasta produced a complete haploid human genome assembly in under 6 h on a single commercial compute node. MarginPolish and HELEN polished haploid assemblies to more than 99.9% identity (Phred quality score QV = 30) with nanopore reads alone. Addition of proximity-ligation sequencing enabled near chromosome-level scaffolds for all 11 genomes. We compare our assembly performance to existing methods for diploid, haploid and trio-binned human samples and report superior accuracy and speed.


Subject(s)
Genome, Human/genetics , High-Throughput Nucleotide Sequencing/methods , Nanopore Sequencing , Sequence Analysis, DNA/methods , Algorithms , Benchmarking , Chromosomes, Human/genetics , Deep Learning , Genomics , HLA Antigens/genetics , Haploidy , High-Throughput Nucleotide Sequencing/standards , Humans , Sequence Analysis, DNA/standards
12.
Nature ; 585(7823): 79-84, 2020 09.
Article in English | MEDLINE | ID: mdl-32663838

ABSTRACT

After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes.


Subject(s)
Chromosomes, Human, X/genetics , Genome, Human/genetics , Telomere/genetics , Centromere/genetics , CpG Islands/genetics , DNA Methylation , DNA, Satellite/genetics , Female , Humans , Hydatidiform Mole/genetics , Male , Pregnancy , Reproducibility of Results , Testis/metabolism
13.
Viruses ; 12(5)2020 05 23.
Article in English | MEDLINE | ID: mdl-32456246

ABSTRACT

The global spread of the parasitic mite Varroa destructor has emphasized the significance of viruses as pathogens of honey bee (Apis mellifera) populations. In particular, the association of deformed wing virus (DWV) with V. destructor and its devastating effect on honey bee colonies has led to that virus now becoming one of the most well-studied insect viruses. However, there has been no opportunity to examine the effects of Varroa mites without the influence of DWV. In Papua New Guinea (PNG), the sister species, V. jacobsoni, has emerged through a host-shift to reproduce on the local A. mellifera population. After initial colony losses, beekeepers have maintained colonies without chemicals for more than a decade, suggesting that this bee population has an unknown mite tolerance mechanism. Using high throughput sequencing (HTS) and target PCR detection, we investigated whether the viral landscape of the PNG honey bee population is the underlying factor responsible for mite tolerance. We found A. mellifera and A. cerana from PNG and nearby Solomon Islands were predominantly infected by sacbrood virus (SBV), black queen cell virus (BQCV) and Lake Sinai viruses (LSV), with no evidence for any DWV strains. V. jacobsoni was infected by several viral homologs to recently discovered V. destructor viruses, but Varroa jacobsoni rhabdovirus-1 (ARV-1 homolog) was the only virus detected in both mites and honey bees. We conclude from these findings that A. mellifera in PNG may tolerate V. jacobsoni because the damage from parasitism is significantly reduced without DWV. This study also provides further evidence that DWV does not exist as a covert infection in all honey bee populations, and remaining free of this serious viral pathogen can have important implications for bee health outcomes in the face of Varroa.


Subject(s)
Bees/parasitology , Bees/virology , Insect Viruses/isolation & purification , RNA Viruses , Varroidae , Amino Acid Sequence , Animals , Female , High-Throughput Nucleotide Sequencing , Insect Viruses/classification , Insect Viruses/genetics , Papua New Guinea , RNA Viruses/classification , RNA Viruses/genetics , RNA Viruses/isolation & purification , Sequence Alignment , Virus Diseases/diagnosis , Virus Diseases/virology
14.
Gigascience ; 9(6)2020 06 01.
Article in English | MEDLINE | ID: mdl-32463100

ABSTRACT

BACKGROUND: Large-scale sequencing projects provide high-quality full-genome data that can be used for reconstruction of chromosomal exchanges and rearrangements that disrupt conserved syntenic blocks. The highest resolution of cross-species homology can be obtained on the basis of whole-genome, reference-free alignments. Very large multiple alignments of full-genome sequence stored in a binary format demand an accurate and efficient computational approach for synteny block production. FINDINGS: halSynteny performs efficient processing of pairwise alignment blocks for any pair of genomes in the alignment. The tool is part of the HAL comparative genomics suite and is targeted to build synteny blocks for multi-hundred-way, reference-free vertebrate alignments built with the Cactus system. CONCLUSIONS: halSynteny enables an accurate and rapid identification of synteny in multiple full-genome alignments. The method is implemented in C++11 as a component of the halTools software and released under MIT license. The package is available at https://github.com/ComparativeGenomicsToolkit/hal/.


Subject(s)
Algorithms , Computational Biology/methods , Genomics/methods , Software , Reproducibility of Results , Sequence Alignment/methods , Synteny
15.
Genome Res ; 30(1): 85-94, 2020 01.
Article in English | MEDLINE | ID: mdl-31857444

ABSTRACT

Transfer RNA (tRNA) genes are among the most highly transcribed genes in the genome owing to their central role in protein synthesis. However, there is evidence for a broad range of gene expression across tRNA loci. This complexity, combined with difficulty in measuring transcript abundance and high sequence identity across transcripts, has severely limited our collective understanding of tRNA gene expression regulation and evolution. We establish sequence-based correlates to tRNA gene expression and develop a tRNA gene classification method that does not require, but benefits from, comparative genomic information and achieves accuracy comparable to molecular assays. We observe that guanine + cytosine (G + C) content and CpG density surrounding tRNA loci is exceptionally well correlated with tRNA gene activity, supporting a prominent regulatory role of the local genomic context in combination with internal sequence features. We use our tRNA gene activity predictions in conjunction with a comprehensive tRNA gene ortholog set spanning 29 placental mammals to estimate the evolutionary rate of functional changes among orthologs. Our method adds a new dimension to large-scale tRNA functional prediction and will help prioritize characterization of functional tRNA variants. Its simplicity and robustness should enable development of similar approaches for other clades, as well as exploration of functional diversification of members of large gene families.


Subject(s)
Genome , Genomics , RNA, Transfer , Animals , Computational Biology/methods , CpG Islands , DNA Methylation , Epigenesis, Genetic , Epigenomics/methods , Genomics/methods , Mammals , Mice , Phylogeny , RNA, Transfer/genetics
16.
G3 (Bethesda) ; 9(6): 1795-1805, 2019 06 05.
Article in English | MEDLINE | ID: mdl-30996023

ABSTRACT

Isogenic laboratory mouse strains enhance reproducibility because individual animals are genetically identical. For the most widely used isogenic strain, C57BL/6, there exists a wealth of genetic, phenotypic, and genomic data, including a high-quality reference genome (GRCm38.p6). Now 20 years after the first release of the mouse reference genome, C57BL/6J mice are at least 26 inbreeding generations removed from GRCm38 and the strain is now maintained with periodic reintroduction of cryorecovered mice derived from a single breeder pair, aptly named Adam and Eve. To provide an update to the mouse reference genome that more accurately represents the genome of today's C57BL/6J mice, we took advantage of long read, short read, and optical mapping technologies to generate a de novo assembly of the C57BL/6J Eve genome (B6Eve). Using these data, we have addressed recurring variants observed in previous mouse genomic studies. We have also identified structural variations, closed gaps in the mouse reference assembly, and revealed previously unannotated coding sequences. This B6Eve assembly explains discrepant observations that have been associated with GRCm38-based analyses, and will inform a reference genome that is more representative of the C57BL/6J mice that are in use today.


Subject(s)
Genome , Genomics , Animals , Computational Biology/methods , Female , Genomics/methods , Inbreeding , Male , Mice , Mice, Inbred C57BL , Pedigree , Phenotype , Polymorphism, Single Nucleotide
17.
Annu Rev Anim Biosci ; 7: 41-64, 2019 02 15.
Article in English | MEDLINE | ID: mdl-30379572

ABSTRACT

Rapidly improving sequencing technology coupled with computational developments in sequence assembly are making reference-quality genome assembly economical. Hundreds of vertebrate genome assemblies are now publicly available, and projects are being proposed to sequence thousands of additional species in the next few years. Such dense sampling of the tree of life should give an unprecedented new understanding of evolution and allow a detailed determination of the events that led to the wealth of biodiversity around us. To gain this knowledge, these new genomes must be compared through genome alignment (at the sequence level) and comparative annotation (at the gene level). However, different alignment and annotation methods have different characteristics; before starting a comparative genomics analysis, it is important to understand the nature of, and biases and limitations inherent in, the chosen methods. This review is intended to act as a technical but high-level overview of the field that should provide this understanding. We briefly survey the state of the genome alignment and comparative annotation fields and potential future directions for these fields in a new, large-scale era of comparative genomics.


Subject(s)
Genome/genetics , Genomics , Animals , Molecular Sequence Annotation
18.
Nucleic Acids Res ; 47(D1): D766-D773, 2019 01 08.
Article in English | MEDLINE | ID: mdl-30357393

ABSTRACT

The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.


Subject(s)
Databases, Genetic , Genome, Human/genetics , Genomics , Pseudogenes/genetics , Animals , Computational Biology , Humans , Internet , Mice , Molecular Sequence Annotation , Software
19.
Nat Genet ; 50(11): 1574-1583, 2018 11.
Article in English | MEDLINE | ID: mdl-30275530

ABSTRACT

We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development.


Subject(s)
Chromosome Mapping , Genetic Loci , Genome , Haplotypes , Mice, Inbred Strains/genetics , Animals , Animals, Laboratory , Chromosome Mapping/veterinary , Haplotypes/genetics , Mice , Mice, Inbred BALB C/genetics , Mice, Inbred C3H/genetics , Mice, Inbred C57BL/genetics , Mice, Inbred CBA/genetics , Mice, Inbred DBA/genetics , Mice, Inbred NOD/genetics , Mice, Inbred Strains/classification , Molecular Sequence Annotation , Phylogeny , Polymorphism, Single Nucleotide , Species Specificity
20.
Genome Res ; 28(11): 1720-1732, 2018 11.
Article in English | MEDLINE | ID: mdl-30341161

ABSTRACT

Despite the rapid development of sequencing technologies, the assembly of mammalian-scale genomes into complete chromosomes remains one of the most challenging problems in bioinformatics. To help address this difficulty, we developed Ragout 2, a reference-assisted assembly tool that works for large and complex genomes. By taking one or more target assemblies (generated from an NGS assembler) and one or multiple related reference genomes, Ragout 2 infers the evolutionary relationships between the genomes and builds the final assemblies using a genome rearrangement approach. By using Ragout 2, we transformed NGS assemblies of 16 laboratory mouse strains into sets of complete chromosomes, leaving <5% of sequence unlocalized per set. Various benchmarks, including PCR testing and realigning of long Pacific Biosciences (PacBio) reads, suggest only a small number of structural errors in the final assemblies, comparable with direct assembly approaches. We applied Ragout 2 to the Mus caroli and Mus pahari genomes, which exhibit karyotype-scale variations compared with other genomes from the Muridae family. Chromosome painting maps confirmed most large-scale rearrangements that Ragout 2 detected. We applied Ragout 2 to improve draft sequences of three ape genomes that have recently been published. Ragout 2 transformed three sets of contigs (generated using PacBio reads only) into chromosome-scale assemblies with accuracy comparable to chromosome assemblies generated in the original study using BioNano maps, Hi-C, BAC clones, and FISH.


Subject(s)
Contig Mapping/methods , Whole Genome Sequencing/methods , Animals , Contig Mapping/standards , Mice , Reference Standards , Whole Genome Sequencing/standards
SELECTION OF CITATIONS
SEARCH DETAIL
...