Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 33
Filtrar
1.
Nature ; 592(7856): 737-746, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33911273

RESUMO

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1-4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.


Assuntos
Genoma , Genômica/métodos , Vertebrados/genética , Animais , Aves , Biblioteca Gênica , Tamanho do Genoma , Genoma Mitocondrial , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência Molecular , Alinhamento de Sequência , Análise de Sequência de DNA , Cromossomos Sexuais/genética
2.
Mol Biol Evol ; 40(5)2023 05 02.
Artigo em Inglês | MEDLINE | ID: mdl-37194566

RESUMO

We present genome sequences for the caecilians Geotrypetes seraphini (3.8 Gb) and Microcaecilia unicolor (4.7 Gb), representatives of a limbless, mostly soil-dwelling amphibian clade with reduced eyes, and unique putatively chemosensory tentacles. More than 69% of both genomes are composed of repeats, with retrotransposons being the most abundant. We identify 1,150 orthogroups that are unique to caecilians and enriched for functions in olfaction and detection of chemical signals. There are 379 orthogroups with signatures of positive selection on caecilian lineages with roles in organ development and morphogenesis, sensory perception, and immunity amongst others. We discover that caecilian genomes are missing the zone of polarizing activity regulatorysequence (ZRS) enhancer of Sonic Hedgehog which is also mutated in snakes. In vivo deletions have shown ZRS is required for limb development in mice, thus, revealing a shared molecular target implicated in the independent evolution of limblessness in snakes and caecilians.


Assuntos
Anfíbios , Proteínas Hedgehog , Animais , Camundongos , Proteínas Hedgehog/genética , Anfíbios/genética , Genoma , Serpentes/genética , Aclimatação , Evolução Molecular
3.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36525368

RESUMO

SUMMARY: We present YaHS, a user-friendly command-line tool for the construction of chromosome-scale scaffolds from Hi-C data. It can be run with a single-line command, requires minimal input from users (an assembly file and an alignment file) which is compatible with similar tools and provides assembly results in multiple formats, thereby enabling rapid, robust and scalable construction of high-quality genome assemblies with high accuracy and contiguity. AVAILABILITY AND IMPLEMENTATION: YaHS is implemented in C and licensed under the MIT License. The source code, documentation and tutorial are available at https://github.com/sanger-tol/yahs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Documentação , Software
4.
BMC Bioinformatics ; 24(1): 288, 2023 Jul 18.
Artigo em Inglês | MEDLINE | ID: mdl-37464285

RESUMO

BACKGROUND:  PacBio high fidelity (HiFi) sequencing reads are both long (15-20 kb) and highly accurate (> Q20). Because of these properties, they have revolutionised genome assembly leading to more accurate and contiguous genomes. In eukaryotes the mitochondrial genome is sequenced alongside the nuclear genome often at very high coverage. A dedicated tool for mitochondrial genome assembly using HiFi reads is still missing. RESULTS:  MitoHiFi was developed within the Darwin Tree of Life Project to assemble mitochondrial genomes from the HiFi reads generated for target species. The input for MitoHiFi is either the raw reads or the assembled contigs, and the tool outputs a mitochondrial genome sequence fasta file along with annotation of protein and RNA genes. Variants arising from heteroplasmy are assembled independently, and nuclear insertions of mitochondrial sequences are identified and not used in organellar genome assembly. MitoHiFi has been used to assemble 374 mitochondrial genomes (368 Metazoa and 6 Fungi species) for the Darwin Tree of Life Project, the Vertebrate Genomes Project and the Aquatic Symbiosis Genome Project. Inspection of 60 mitochondrial genomes assembled with MitoHiFi for species that already have reference sequences in public databases showed the widespread presence of previously unreported repeats. CONCLUSIONS:  MitoHiFi is able to assemble mitochondrial genomes from a wide phylogenetic range of taxa from Pacbio HiFi data. MitoHiFi is written in python and is freely available on GitHub ( https://github.com/marcelauliano/MitoHiFi ). MitoHiFi is available with its dependencies as a Docker container on GitHub (ghcr.io/marcelauliano/mitohifi:master).


Assuntos
Genoma Mitocondrial , Filogenia , RNA , Eucariotos , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga Escala
6.
Nature ; 546(7658): 370-375, 2017 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-28489815

RESUMO

Technology utilizing human induced pluripotent stem cells (iPS cells) has enormous potential to provide improved cellular models of human disease. However, variable genetic and phenotypic characterization of many existing iPS cell lines limits their potential use for research and therapy. Here we describe the systematic generation, genotyping and phenotyping of 711 iPS cell lines derived from 301 healthy individuals by the Human Induced Pluripotent Stem Cells Initiative. Our study outlines the major sources of genetic and phenotypic variation in iPS cells and establishes their suitability as models of complex human traits and cancer. Through genome-wide profiling we find that 5-46% of the variation in different iPS cell phenotypes, including differentiation capacity and cellular morphology, arises from differences between individuals. Additionally, we assess the phenotypic consequences of genomic copy-number alterations that are repeatedly observed in iPS cells. In addition, we present a comprehensive map of common regulatory variants affecting the transcriptome of human pluripotent cells.


Assuntos
Variação Genética/genética , Células-Tronco Pluripotentes Induzidas/metabolismo , Células Cultivadas , Reprogramação Celular/genética , Variações do Número de Cópias de DNA/genética , Regulação da Expressão Gênica/genética , Genótipo , Humanos , Especificidade de Órgãos , Fenótipo , Controle de Qualidade , Locos de Características Quantitativas/genética , Transcriptoma/genética
7.
BMC Bioinformatics ; 22(1): 569, 2021 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-34837944

RESUMO

BACKGROUND: Efficient and effective genome scaffolding tools are still in high demand for generating reference-quality assemblies. While long read data itself is unlikely to create a chromosome-scale assembly for most eukaryotic species, the inexpensive Hi-C sequencing technology, capable of capturing the chromosomal profile of a genome, is now widely used to complete the task. However, the existing Hi-C based scaffolding tools either require a priori chromosome number as input, or lack the ability to build highly continuous scaffolds. RESULTS: We design and develop a novel Hi-C based scaffolding tool, pin_hic, which takes advantage of contact information from Hi-C reads to construct a scaffolding graph iteratively based on N-best neighbors of contigs. Subsequent to scaffolding, it identifies potential misjoins and breaks them to keep the scaffolding accuracy. Through our tests on three long read based de novo assemblies from three different species, we demonstrate that pin_hic is more efficient than current standard state-of-art tools, and it can generate much more continuous scaffolds, while achieving a higher or comparable accuracy. CONCLUSIONS: Pin_hic is an efficient Hi-C based scaffolding tool, which can be useful for building chromosome-scale assemblies. As many sequencing projects have been launched in the recent years, we believe pin_hic has potential to be applied in these projects and makes a meaningful contribution.


Assuntos
Genoma , Genômica , Cromossomos/genética , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
8.
Bioinformatics ; 36(9): 2896-2898, 2020 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31971576

RESUMO

MOTIVATION: Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. RESULTS: Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. AVAILABILITY AND IMPLEMENTATION: The source code is written in C and is available at https://github.com/dfguan/purge_dups. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Genoma , Haplótipos , Análise de Sequência de DNA
9.
Genome Res ; 27(2): 300-309, 2017 02.
Artigo em Inglês | MEDLINE | ID: mdl-27986821

RESUMO

We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference-based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a full-text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project. A key feature is that, as more genomes are added, identical read sequences are increasingly observed, and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out nonreference queries to search for the presence of all known viral genomes and discover human T-lymphotropic virus 1 integrations in six samples in a recognized epidemiological distribution.


Assuntos
Genoma Humano/genética , Genômica , Alinhamento de Sequência/métodos , Sequenciamento Completo do Genoma/métodos , Alelos , Compressão de Dados , Genótipo , Humanos , Mutação INDEL/genética , Análise de Sequência de DNA , Software
10.
Bioinformatics ; 35(2): 337-339, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-29992288

RESUMO

Motivation: The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results: On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details). Availability and implementation: Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Sequenciamento de Nucleotídeos em Larga Escala
11.
Bioinformatics ; 33(13): 2037-2039, 2017 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-28205675

RESUMO

MOTIVATION: Prediction of functional variant consequences is an important part of sequencing pipelines, allowing the categorization and prioritization of genetic variants for follow up analysis. However, current predictors analyze variants as isolated events, which can lead to incorrect predictions when adjacent variants alter the same codon, or when a frame-shifting indel is followed by a frame-restoring indel. Exploiting known haplotype information when making consequence predictions can resolve these issues. RESULTS: BCFtools/csq is a fast program for haplotype-aware consequence calling which can take into account known phase. Consequence predictions are changed for 501 of 5019 compound variants found in the 81.7M variants in the 1000 Genomes Project data, with an average of 139 compound variants per haplotype. Predictions match existing tools when run in localized mode, but the program is an order of magnitude faster and requires an order of magnitude less memory. AVAILABILITY AND IMPLEMENTATION: The program is freely available for commercial and non-commercial use in the BCFtools package which is available for download from http://samtools.github.io/bcftools . CONTACT: pd3@sanger.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Variação Genética , Genoma Humano , Haplótipos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Genômica/métodos , Humanos , Mutação INDEL
13.
Wellcome Open Res ; 8: 74, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37424773

RESUMO

We present a genome assembly from an individual female Anopheles gambiae (the malaria mosquito; Arthropoda; Insecta; Diptera; Culicidae), Ifakara strain. The genome sequence is 264 megabases in span. Most of the assembly is scaffolded into three chromosomal pseudomolecules with the X sex chromosome assembled. The complete mitochondrial genome was also assembled and is 15.4 kilobases in length.

14.
Wellcome Open Res ; 8: 507, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38046191

RESUMO

We present a genome assembly from an individual male Anopheles moucheti (the malaria mosquito; Arthropoda; Insecta; Diptera; Culicidae), from a wild population in Cameroon. The genome sequence is 271 megabases in span. The majority of the assembly is scaffolded into three chromosomal pseudomolecules with the X sex chromosome assembled. The complete mitochondrial genome was also assembled and is 15.5 kilobases in length.

15.
Nat Commun ; 14(1): 3412, 2023 06 09.
Artigo em Inglês | MEDLINE | ID: mdl-37296119

RESUMO

Numerous novel adaptations characterise the radiation of notothenioids, the dominant fish group in the freezing seas of the Southern Ocean. To improve understanding of the evolution of this iconic fish group, here we generate and analyse new genome assemblies for 24 species covering all major subgroups of the radiation, including five long-read assemblies. We present a new estimate for the onset of the radiation at 10.7 million years ago, based on a time-calibrated phylogeny derived from genome-wide sequence data. We identify a two-fold variation in genome size, driven by expansion of multiple transposable element families, and use the long-read data to reconstruct two evolutionarily important, highly repetitive gene family loci. First, we present the most complete reconstruction to date of the antifreeze glycoprotein gene family, whose emergence enabled survival in sub-zero temperatures, showing the expansion of the antifreeze gene locus from the ancestral to the derived state. Second, we trace the loss of haemoglobin genes in icefishes, the only vertebrates lacking functional haemoglobins, through complete reconstruction of the two haemoglobin gene clusters across notothenioid families. Both the haemoglobin and antifreeze genomic loci are characterised by multiple transposon expansions that may have driven the evolutionary history of these genes.


Assuntos
Peixes , Perciformes , Animais , Peixes/genética , Genômica , Vertebrados , Filogenia , Hemoglobinas/genética , Regiões Antárticas
16.
Gigascience ; 10(2)2021 02 16.
Artigo em Inglês | MEDLINE | ID: mdl-33590861

RESUMO

BACKGROUND: SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods. FINDINGS: The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. CONCLUSION: Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed >1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Genoma , Genômica
17.
G3 (Bethesda) ; 11(5)2021 05 07.
Artigo em Inglês | MEDLINE | ID: mdl-33734373

RESUMO

Hermetia illucens L. (Diptera: Stratiomyidae), the Black Soldier Fly (BSF) is an increasingly important species for bioconversion of organic material into animal feed. We generated a high-quality chromosome-scale genome assembly of the BSF using Pacific Bioscience, 10X Genomics linked read and high-throughput chromosome conformation capture sequencing technology. Scaffolding the final assembly with Hi-C data produced a highly contiguous 1.01 Gb genome with 99.75% of scaffolds assembled into pseudochromosomes representing seven chromosomes with 16.01 Mb contig and 180.46 Mb scaffold N50 values. The highly complete genome obtained a Benchmarking Universal Single-Copy Orthologs (BUSCO) completeness of 98.6%. We masked 67.32% of the genome as repetitive sequences and annotated a total of 16,478 protein-coding genes using the BRAKER2 pipeline. We analyzed an established lab population to investigate the genomic variation and architecture of the BSF revealing six autosomes and an X chromosome. Additionally, we estimated the inbreeding coefficient (1.9%) of the lab population by assessing runs of homozygosity. This provided evidence for inbreeding events including long runs of homozygosity on chromosome 5. The release of this novel chromosome-scale BSF genome assembly will provide an improved resource for further genomic studies, functional characterization of genes of interest and genetic modification of this economically important species.


Assuntos
Cromossomos , Dípteros , Animais , Cromossomos/genética , Dípteros/genética , Genoma , Genômica , Sequências Repetitivas de Ácido Nucleico
18.
Wellcome Open Res ; 6: 225, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34703904

RESUMO

We present a genome assembly from a clonal population of Eimeria tenella Houghton parasites (Apicomplexa; Conoidasida; Eucoccidiorida; Eimeriidae). The genome sequence is 53.25 megabases in span. The entire assembly is scaffolded into 15 chromosomal pseudomolecules, with complete mitochondrion and apicoplast organellar genomes also present.

19.
Wellcome Open Res ; 6: 162, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35600244

RESUMO

We present a genome assembly from an individual male Arvicola amphibius (the European water vole; Chordata; Mammalia; Rodentia; Cricetidae). The genome sequence is 2.30 gigabases in span. The majority of the assembly is scaffolded into 18 chromosomal pseudomolecules, including the X sex chromosome. Gene annotation of this assembly on Ensembl has identified 21,394 protein coding genes.

20.
Wellcome Open Res ; 6: 118, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34660910

RESUMO

We present a genome assembly from an individual male Rattus norvegicus (the Norway rat; Chordata; Mammalia; Rodentia; Muridae). The genome sequence is 2.44 gigabases in span. The majority of the assembly is scaffolded into 20 chromosomal pseudomolecules, with both X and Y sex chromosomes assembled. This genome assembly, mRatBN7.2, represents the new reference genome for R. norvegicus and has been adopted by the Genome Reference Consortium.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA