Pesquisa | BVS - MINISTÉRIO DA SAÚDE

A critical comparison of technologies for a plant genome sequencing project.

Paajanen, Pirita; Kettleborough, George; López-Girona, Elena; Giolai, Michael; Heavens, Darren; Baker, David; Lister, Ashleigh; Cugliandolo, Fiorella; Wilde, Gail; Hein, Ingo; Macaulay, Iain; Bryan, Glenn J; Clark, Matthew D.

Gigascience ; 8(3)2019 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-30624602

RESUMO

BACKGROUND: A high-quality genome sequence of any model organism is an essential starting point for genetic and other studies. Older clone-based methods are slow and expensive, whereas faster, cheaper short-read-only assemblies can be incomplete and highly fragmented, which minimizes their usefulness. The last few years have seen the introduction of many new technologies for genome assembly. These new technologies and associated new algorithms are typically benchmarked on microbial genomes or, if they scale appropriately, on larger (e.g., human) genomes. However, plant genomes can be much more repetitive and larger than the human genome, and plant biochemistry often makes obtaining high-quality DNA that is free from contaminants difficult. Reflecting their challenging nature, we observe that plant genome assembly statistics are typically poorer than for vertebrates. RESULTS: Here, we compare Illumina short read, Pacific Biosciences long read, 10x Genomics linked reads, Dovetail Hi-C, and BioNano Genomics optical maps, singly and combined, in producing high-quality long-range genome assemblies of the potato species Solanum verrucosum. We benchmark the assemblies for completeness and accuracy, as well as DNA compute requirements and sequencing costs. CONCLUSIONS: The field of genome sequencing and assembly is reaching maturity, and the differences we observe between assemblies are surprisingly small. We expect that our results will be helpful to other genome projects, and that these datasets will be used in benchmarking by assembly algorithm developers.

Assuntos

Genoma de Planta , Genômica/métodos , Análise de Sequência de DNA/métodos , Mapeamento de Sequências Contíguas , Custos e Análise de Custo , Genes de Plantas , Genômica/economia , Humanos , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Análise de Sequência de DNA/economia , Solanaceae/genética

Independent assessment and improvement of wheat genome sequence assemblies using Fosill jumping libraries.

Lu, Fu-Hao; McKenzie, Neil; Kettleborough, George; Heavens, Darren; Clark, Matthew D; Bevan, Michael W.

Gigascience ; 7(5)2018 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-29762659

RESUMO

Background: The accurate sequencing and assembly of very large, often polyploid, genomes remains a challenging task, limiting long-range sequence information and phased sequence variation for applications such as plant breeding. The 15-Gb hexaploid bread wheat (Triticum aestivum) genome has been particularly challenging to sequence, and several different approaches have recently generated long-range assemblies. Mapping and understanding the types of assembly errors are important for optimising future sequencing and assembly approaches and for comparative genomics. Results: Here we use a Fosill 38-kb jumping library to assess medium and longer-range order of different publicly available wheat genome assemblies. Modifications to the Fosill protocol generated longer Illumina sequences and enabled comprehensive genome coverage. Analyses of two independent Bacterial Artificial Chromosome (BAC)-based chromosome-scale assemblies, two independent Illumina whole genome shotgun assemblies, and a hybrid Single Molecule Real Time (SMRT-PacBio) and short read (Illumina) assembly were carried out. We revealed a surprising scale and variety of discrepancies using Fosill mate-pair mapping and validated several of each class. In addition, Fosill mate-pairs were used to scaffold a whole genome Illumina assembly, leading to a 3-fold increase in N50 values. Conclusions: Our analyses, using an independent means to validate different wheat genome assemblies, show that whole genome shotgun assemblies based solely on Illumina sequences are significantly more accurate by all measures compared to BAC-based chromosome-scale assemblies and hybrid SMRT-Illumina approaches. Although current whole genome assemblies are reasonably accurate and useful, additional improvements will be needed to generate complete assemblies of wheat genomes using open-source, computationally efficient, and cost-effective methods.

Assuntos

Biblioteca Gênica , Genoma de Planta , Análise de Sequência de DNA/métodos , Triticum/genética , Cromossomos Artificiais Bacterianos/genética , Cromossomos de Plantas/genética , Mapeamento de Sequências Contíguas

An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations.

Clavijo, Bernardo J; Venturini, Luca; Schudoma, Christian; Accinelli, Gonzalo Garcia; Kaithakottil, Gemy; Wright, Jonathan; Borrill, Philippa; Kettleborough, George; Heavens, Darren; Chapman, Helen; Lipscombe, James; Barker, Tom; Lu, Fu-Hao; McKenzie, Neil; Raats, Dina; Ramirez-Gonzalez, Ricardo H; Coince, Aurore; Peel, Ned; Percival-Alwyn, Lawrence; Duncan, Owen; Trösch, Josua; Yu, Guotai; Bolser, Dan M; Namaati, Guy; Kerhornou, Arnaud; Spannagl, Manuel; Gundlach, Heidrun; Haberer, Georg; Davey, Robert P; Fosker, Christine; Palma, Federica Di; Phillips, Andrew L; Millar, A Harvey; Kersey, Paul J; Uauy, Cristobal; Krasileva, Ksenia V; Swarbreck, David; Bevan, Michael W; Clark, Matthew D.

Genome Res ; 27(5): 885-896, 2017 05.

Artigo em Inglês | MEDLINE | ID: mdl-28420692

RESUMO

Advances in genome sequencing and assembly technologies are generating many high-quality genome sequences, but assemblies of large, repeat-rich polyploid genomes, such as that of bread wheat, remain fragmented and incomplete. We have generated a new wheat whole-genome shotgun sequence assembly using a combination of optimized data types and an assembly algorithm designed to deal with large and complex genomes. The new assembly represents >78% of the genome with a scaffold N50 of 88.8 kb that has a high fidelity to the input data. Our new annotation combines strand-specific Illumina RNA-seq and Pacific Biosciences (PacBio) full-length cDNAs to identify 104,091 high-confidence protein-coding genes and 10,156 noncoding RNA genes. We confirmed three known and identified one novel genome rearrangements. Our approach enables the rapid and scalable assembly of wheat genomes, the identification of structural variants, and the definition of complete gene models, all powerful resources for trait analysis and breeding of this key global crop.

Assuntos

Mapeamento de Sequências Contíguas/métodos , Genoma de Planta , Anotação de Sequência Molecular/métodos , Proteínas de Plantas/genética , Translocação Genética , Triticum/genética , Algoritmos , Mapeamento de Sequências Contíguas/normas , Anotação de Sequência Molecular/normas , Polimorfismo Genético , Poliploidia

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies.

Mapleson, Daniel; Garcia Accinelli, Gonzalo; Kettleborough, George; Wright, Jonathan; Clavijo, Bernardo J.

Bioinformatics ; 33(4): 574-576, 2017 02 15.

Artigo em Inglês | MEDLINE | ID: mdl-27797770

RESUMO

Motivation: De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilized by assemblers, provides useful insights that can inform the assembly process and result in better assemblies. Results: We present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT's ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies. Availability and Implementation: KAT is available under the GPLv3 license at: https://github.com/TGAC/KAT . Contact: bernardo.clavijo@earlham.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Genoma de Planta , Sequenciamento de Nucleotídeos em Larga Escala/normas , Controle de Qualidade , Análise de Sequência de DNA/normas , Software , Fraxinus/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos

Reconstructing (super)trees from data sets with missing distances: not all is lost.

Kettleborough, George; Dicks, Jo; Roberts, Ian N; Huber, Katharina T.

Mol Biol Evol ; 32(6): 1628-42, 2015 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-25657329

RESUMO

The wealth of phylogenetic information accumulated over many decades of biological research, coupled with recent technological advances in molecular sequence generation, presents significant opportunities for researchers to investigate relationships across and within the kingdoms of life. However, to make best use of this data wealth, several problems must first be overcome. One key problem is finding effective strategies to deal with missing data. Here, we introduce Lasso, a novel heuristic approach for reconstructing rooted phylogenetic trees from distance matrices with missing values, for data sets where a molecular clock may be assumed. Contrary to other phylogenetic methods on partial data sets, Lasso possesses desirable properties such as its reconstructed trees being both unique and edge-weighted. These properties are achieved by Lasso restricting its leaf set to a large subset of all possible taxa, which in many practical situations is the entire taxa set. Furthermore, the Lasso approach is distance-based, rendering it very fast to run and suitable for data sets of all sizes, including large data sets such as those generated by modern Next Generation Sequencing technologies. To better understand the performance of Lasso, we assessed it by means of artificial and real biological data sets, showing its effectiveness in the presence of missing data. Furthermore, by formulating the supermatrix problem as a particular case of the missing data problem, we assessed Lasso's ability to reconstruct supertrees. We demonstrate that, although not specifically designed for such a purpose, Lasso performs better than or comparably with five leading supertree algorithms on a challenging biological data set. Finally, we make freely available a software implementation of Lasso so that researchers may, for the first time, perform both rooted tree and supertree reconstruction with branch lengths on their own partial data sets.

Assuntos

Bases de Dados Genéticas , Modelos Genéticos , Filogenia , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Saccharomyces cerevisiae/classificação , Saccharomyces cerevisiae/genética , Software , Triticum/classificação , Triticum/genética

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA