Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 130
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Cell ; 187(6): 1547-1562.e13, 2024 Mar 14.
Artículo en Inglés | MEDLINE | ID: mdl-38428424

RESUMEN

We sequenced and assembled using multiple long-read sequencing technologies the genomes of chimpanzee, bonobo, gorilla, orangutan, gibbon, macaque, owl monkey, and marmoset. We identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. We estimate that 819.47 Mbp or ∼27% of the genome has been affected by SVs across primate evolution. We identify 1,607 structurally divergent regions wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (e.g., CARD, C4, and OLAH gene families) and additional lineage-specific genes are generated (e.g., CKAP2, VPS36, ACBD7, and NEK5 paralogs), becoming targets of rapid chromosomal diversification and positive selection (e.g., RGPD gene family). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species.


Asunto(s)
Genoma , Primates , Animales , Humanos , Secuencia de Bases , Primates/clasificación , Primates/genética , Evolución Biológica , Análisis de Secuencia de ADN , Variación Estructural del Genoma
2.
Nature ; 617(7960): 325-334, 2023 05.
Artículo en Inglés | MEDLINE | ID: mdl-37165237

RESUMEN

Single-nucleotide variants (SNVs) in segmental duplications (SDs) have not been systematically assessed because of the limitations of mapping short-read sequencing data1,2. Here we constructed 1:1 unambiguous alignments spanning high-identity SDs across 102 human haplotypes and compared the pattern of SNVs between unique and duplicated regions3,4. We find that human SNVs are elevated 60% in SDs compared to unique regions and estimate that at least 23% of this increase is due to interlocus gene conversion (IGC) with up to 4.3 megabase pairs of SD sequence converted on average per human haplotype. We develop a genome-wide map of IGC donors and acceptors, including 498 acceptor and 454 donor hotspots affecting the exons of about 800 protein-coding genes. These include 171 genes that have 'relocated' on average 1.61 megabase pairs in a subset of human haplotypes. Using a coalescent framework, we show that SD regions are slightly evolutionarily older when compared to unique sequences, probably owing to IGC. SNVs in SDs, however, show a distinct mutational spectrum: a 27.1% increase in transversions that convert cytosine to guanine or the reverse across all triplet contexts and a 7.6% reduction in the frequency of CpG-associated mutations when compared to unique DNA. We reason that these distinct mutational properties help to maintain an overall higher GC content of SD DNA compared to that of unique DNA, probably driven by GC-biased conversion between paralogous sequences5,6.


Asunto(s)
Conversión Génica , Mutación , Duplicaciones Segmentarias en el Genoma , Humanos , Conversión Génica/genética , Genoma Humano/genética , Polimorfismo de Nucleótido Simple/genética , Haplotipos/genética , Exones/genética , Citosina/química , Guanina/química , Islas de CpG/genética
3.
Nature ; 604(7906): 437-446, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35444317

RESUMEN

The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal is to improve data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.


Asunto(s)
Genoma Humano , Genómica , Genoma Humano/genética , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN
4.
Artículo en Inglés | MEDLINE | ID: mdl-38663087

RESUMEN

The Human Genome Project was an enormous accomplishment, providing a foundation for countless explorations into the genetics and genomics of the human species. Yet for many years, the human genome reference sequence remained incomplete and lacked representation of human genetic diversity. Recently, two major advances have emerged to address these shortcomings: complete gap-free human genome sequences, such as the one developed by the Telomere-to-Telomere Consortium, and high-quality pangenomes, such as the one developed by the Human Pangenome Reference Consortium. Facilitated by advances in long-read DNA sequencing and genome assembly algorithms, complete human genome sequences resolve regions that have been historically difficult to sequence, including centromeres, telomeres, and segmental duplications. In parallel, pangenomes capture the extensive genetic diversity across populations worldwide. Together, these advances usher in a new era of genomics research, enhancing the accuracy of genomic analysis, paving the path for precision medicine, and contributing to deeper insights into human biology.

5.
Genome Res ; 34(3): 454-468, 2024 Apr 25.
Artículo en Inglés | MEDLINE | ID: mdl-38627094

RESUMEN

Reference-free genome phasing is vital for understanding allele inheritance and the impact of single-molecule DNA variation on phenotypes. To achieve thorough phasing across homozygous or repetitive regions of the genome, long-read sequencing technologies are often used to perform phased de novo assembly. As a step toward reducing the cost and complexity of this type of analysis, we describe new methods for accurately phasing Oxford Nanopore Technologies (ONT) sequence data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of ONT PromethION sequencing, including those using proximity ligation, and show that newer, higher accuracy ONT reads substantially improve assembly quality.


Asunto(s)
Nanoporos , Humanos , Análisis de Secuencia de ADN/métodos , Secuenciación de Nanoporos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Genómica/métodos
6.
Nature ; 594(7861): 77-81, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-33953399

RESUMEN

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3-5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.


Asunto(s)
Evolución Molecular , Genoma/genética , Genómica , Pan paniscus/genética , Filogenia , Animales , Factor 4A Eucariótico de Iniciación/genética , Femenino , Genes , Gorilla gorilla/genética , Anotación de Secuencia Molecular/normas , Pan troglodytes/genética , Pongo/genética , Duplicaciones Segmentarias en el Genoma , Análisis de Secuencia de ADN
7.
Proc Natl Acad Sci U S A ; 121(15): e2319506121, 2024 Apr 09.
Artículo en Inglés | MEDLINE | ID: mdl-38557186

RESUMEN

Genomes are typically mosaics of regions with different evolutionary histories. When speciation events are closely spaced in time, recombination makes the regions sharing the same history small, and the evolutionary history changes rapidly as we move along the genome. When examining rapid radiations such as the early diversification of Neoaves 66 Mya, typically no consistent history is observed across segments exceeding kilobases of the genome. Here, we report an exception. We found that a 21-Mb region in avian genomes, mapped to chicken chromosome 4, shows an extremely strong and discordance-free signal for a history different from that of the inferred species tree. Such a strong discordance-free signal, indicative of suppressed recombination across many millions of base pairs, is not observed elsewhere in the genome for any deep avian relationships. Although long regions with suppressed recombination have been documented in recently diverged species, our results pertain to relationships dating circa 65 Mya. We provide evidence that this strong signal may be due to an ancient rearrangement that blocked recombination and remained polymorphic for several million years prior to fixation. We show that the presence of this region has misled previous phylogenomic efforts with lower taxon sampling, showing the interplay between taxon and locus sampling. We predict that similar ancient rearrangements may confound phylogenetic analyses in other clades, pointing to a need for new analytical models that incorporate the possibility of such events.


Asunto(s)
Evolución Biológica , Genoma , Animales , Filogenia , Genoma/genética , Aves , Recombinación Genética
8.
Genome Res ; 33(4): 496-510, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-37164484

RESUMEN

There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.


Asunto(s)
ADN Satélite , Polimorfismo Genético , Humanos , ADN Satélite/genética , Haplotipos , Duplicaciones Segmentarias en el Genoma , Análisis de Secuencia de ADN
9.
Nat Methods ; 20(2): 239-247, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36646895

RESUMEN

Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.


Asunto(s)
Biología Computacional , Perfilación de la Expresión Génica , Haplotipos , Metagenómica , Transcriptoma
10.
Nat Methods ; 20(10): 1483-1492, 2023 10.
Artículo en Inglés | MEDLINE | ID: mdl-37710018

RESUMEN

Long-read sequencing technologies substantially overcome the limitations of short-reads but have not been considered as a feasible replacement for population-scale projects, being a combination of too expensive, not scalable enough or too error-prone. Here we develop an efficient and scalable wet lab and computational protocol, Napu, for Oxford Nanopore Technologies long-read sequencing that seeks to address those limitations. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the National Institutes of Health Center for Alzheimer's and Related Dementias. Using a single PromethION flow cell, we can detect single nucleotide polymorphisms with F1-score comparable to Illumina short-read sequencing. Small indel calling remains difficult within homopolymers and tandem repeats, but achieves good concordance to Illumina indel calls elsewhere. Further, we can discover structural variants with F1-score on par with state-of-the-art de novo assembly methods. Our protocol phases small and structural variants at megabase scales and produces highly accurate, haplotype-specific methylation calls.


Asunto(s)
Genoma Humano , Secuenciación de Nanoporos , Humanos , Análisis de Secuencia de ADN/métodos , Haplotipos , Metilación , Proyectos Piloto , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
11.
Nature ; 587(7833): 246-251, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33177663

RESUMEN

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1-3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


Asunto(s)
Genoma/genética , Genómica/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Vertebrados/genética , Amnios , Animales , Simulación por Computador , Genómica/normas , Haplotipos , Humanos , Control de Calidad , Alineación de Secuencia/normas , Programas Informáticos/normas
12.
Nature ; 585(7823): 79-84, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32663838

RESUMEN

After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes.


Asunto(s)
Cromosomas Humanos X/genética , Genoma Humano/genética , Telómero/genética , Centrómero/genética , Islas de CpG/genética , Metilación de ADN , ADN Satélite/genética , Femenino , Humanos , Mola Hidatiforme/genética , Masculino , Embarazo , Reproducibilidad de los Resultados , Testículo/metabolismo
13.
Nucleic Acids Res ; 52(D1): D1082-D1088, 2024 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-37953330

RESUMEN

The UCSC Genome Browser (https://genome.ucsc.edu) is a web-based genomic visualization and analysis tool that serves data to over 7,000 distinct users per day worldwide. It provides annotation data on thousands of genome assemblies, ranging from human to SARS-CoV2. This year, we have introduced new data from the Human Pangenome Reference Consortium and on viral genomes including SARS-CoV2. We have added 1,200 new genomes to our GenArk genome system, increasing the overall diversity of our genomic representation. We have added support for nine new user-contributed track hubs to our public hub system. Additionally, we have released 29 new tracks on the human genome and 11 new tracks on the mouse genome. Collectively, these new features expand both the breadth and depth of the genomic knowledge that we share publicly with users worldwide.


Asunto(s)
Bases de Datos Genéticas , Genómica , ARN Viral , Animales , Humanos , Ratones , Genoma Humano , Genoma Viral , Internet , Anotación de Secuencia Molecular , Programas Informáticos
14.
Genome Res ; 32(5): 893-903, 2022 05.
Artículo en Inglés | MEDLINE | ID: mdl-35483961

RESUMEN

Methods that use a linear genome reference for genome sequencing data analysis are reference-biased. In the field of clinical genetics for rare diseases, a resulting reduction in genotyping accuracy in some regions has likely prevented the resolution of some cases. Pangenome graphs embed population variation into a reference structure. Although pangenome graphs have helped to reduce reference mapping bias, further performance improvements are possible. We introduce VG-Pedigree, a pedigree-aware workflow based on the pangenome-mapping tool of Giraffe and the variant calling tool DeepTrio using a specially trained model for Giraffe-based alignments. We demonstrate mapping and variant calling improvements in both single-nucleotide variants (SNVs) and insertion and deletion (indel) variants over those produced by alignments created using BWA-MEM to a linear-reference and Giraffe mapping to a pangenome graph containing data from the 1000 Genomes Project. We have also adapted and upgraded deleterious-variant (DV) detecting methods and programs into a streamlined workflow. We used these workflows in combination to detect small lists of candidate DVs among 15 family quartets and quintets of the Undiagnosed Diseases Program (UDP). All candidate DVs that were previously diagnosed using the Mendelian models covered by the previously published methods were recapitulated by these workflows. The results of these experiments indicate that a slightly greater absolute count of DVs are detected in the proband population than in their matched unaffected siblings.


Asunto(s)
Genoma , Polimorfismo de Nucleótido Simple , Secuenciación de Nucleótidos de Alto Rendimiento , Mutación INDEL , Linaje , Programas Informáticos , Flujo de Trabajo
15.
Nat Methods ; 19(6): 687-695, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35361931

RESUMEN

Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Nanoporos , Femenino , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Embarazo , Análisis de Secuencia de ADN/métodos , Telómero/genética
16.
Am J Hum Genet ; 108(5): 809-824, 2021 05 06.
Artículo en Inglés | MEDLINE | ID: mdl-33794196

RESUMEN

Variable number tandem repeats (VNTRs) are composed of large tandemly repeated motifs, many of which are highly polymorphic in copy number. However, because of their large size and repetitive nature, they remain poorly studied. To investigate the regulatory potential of VNTRs, we used read-depth data from Illumina whole-genome sequencing to perform association analysis between copy number of ∼70,000 VNTRs (motif size ≥ 10 bp) with both gene expression (404 samples in 48 tissues) and DNA methylation (235 samples in peripheral blood), identifying thousands of VNTRs that are associated with local gene expression (eVNTRs) and DNA methylation levels (mVNTRs). Using an independent cohort, we validated 73%-80% of signals observed in the two discovery cohorts, while allelic analysis of VNTR length and CpG methylation in 30 Oxford Nanopore genomes gave additional support for mVNTR loci, thus providing robust evidence to support that these represent genuine associations. Further, conditional analysis indicated that many eVNTRs and mVNTRs act as QTLs independently of other local variation. We also observed strong enrichments of eVNTRs and mVNTRs for regulatory features such as enhancers and promoters. Using the Human Genome Diversity Panel, we define sets of VNTRs that show highly divergent copy numbers among human populations and show that these are enriched for regulatory effects and preferentially associate with genes that have been linked with human phenotypes through GWASs. Our study provides strong evidence supporting functional variation at thousands of VNTRs and defines candidate sets of VNTRs, copy number variation of which potentially plays a role in numerous human phenotypes.


Asunto(s)
Variaciones en el Número de Copia de ADN/genética , Metilación de ADN , Regulación de la Expresión Génica , Repeticiones de Minisatélite/genética , Sitios de Carácter Cuantitativo/genética , Adolescente , Adulto , Algoritmos , Niño , Preescolar , Cromosomas Humanos X/genética , Estudios de Cohortes , Islas de CpG/genética , Elementos de Facilitación Genéticos/genética , Femenino , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Lactante , Recién Nacido , Masculino , Persona de Mediana Edad , Fenotipo , Regiones Promotoras Genéticas/genética , Adulto Joven
17.
Genome Res ; 31(11): 2035-2049, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34667117

RESUMEN

Vocal learning, the ability to imitate sounds from conspecifics and the environment, is a key component of human spoken language and learned song in three independently evolved avian groups-oscine songbirds, parrots, and hummingbirds. Humans and each of these three bird clades exhibit specialized behavioral, neuroanatomical, and brain gene expression convergence related to vocal learning, speech, and song. To understand the evolutionary basis of vocal learning gene specializations and convergence, we searched for and identified accelerated genomic regions (ARs), a marker of positive selection, specific to vocal learning birds. We found avian vocal learner-specific ARs, and they were enriched in noncoding regions near genes with known speech functions or brain gene expression specializations in humans and vocal learning birds, including FOXP2, NEUROD6, ZEB2, and MEF2C, and near genes with major neurodevelopmental functions, including NR2F1, NRP2, and BCL11B We also found enrichment near the SFARI class S genes associated with syndromic vocal communication forms of autism spectrum disorders. These findings reveal strong candidate noncoding regions near genes for the evolutionary adaptations that distinguish vocal learning species from their close vocal nonlearning relatives and provide further evidence of molecular convergence between birdsong and human spoken language.


Asunto(s)
Pájaros Cantores , Habla , Animales , Encéfalo/metabolismo , Genómica , Humanos , Aprendizaje , Proteínas Represoras/metabolismo , Pájaros Cantores/genética , Proteínas Supresoras de Tumor/metabolismo , Vocalización Animal
18.
Nat Methods ; 18(11): 1322-1332, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34725481

RESUMEN

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).


Asunto(s)
Genes , Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Nanoporos , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Genoma Humano , Humanos , Anotación de Secuencia Molecular
19.
Bioinformatics ; 39(2)2023 02 03.
Artículo en Inglés | MEDLINE | ID: mdl-36749013

RESUMEN

MOTIVATION: Pairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFA's O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement. RESULTS: In this article, we present the bidirectional WFA algorithm, the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFA's time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times. AVAILABILITY AND IMPLEMENTATION: All code is publicly available at https://github.com/smarco/BiWFA-paper. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Genómica , Biología Computacional , Genoma , Análisis de Secuencia de ADN , Programas Informáticos
20.
Plant Physiol ; 191(1): 35-46, 2023 01 02.
Artículo en Inglés | MEDLINE | ID: mdl-36200899

RESUMEN

We review how a data infrastructure for the Plant Cell Atlas might be built using existing infrastructure and platforms. The Human Cell Atlas has developed an extensive infrastructure for human and mouse single cell data, while the European Bioinformatics Institute has developed a Single Cell Expression Atlas, that currently houses several plant data sets. We discuss issues related to appropriate ontologies for describing a plant single cell experiment. We imagine how such an infrastructure will enable biologists and data scientists to glean new insights into plant biology in the coming decades, as long as such data are made accessible to the community in an open manner.


Asunto(s)
Biología Computacional , Células Vegetales , Animales , Humanos , Ratones , Plantas/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA