Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 58
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Mol Cell ; 77(6): 1307-1321.e10, 2020 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-31954095

RESUMEN

A comprehensive catalog of cancer driver mutations is essential for understanding tumorigenesis and developing therapies. Exome-sequencing studies have mapped many protein-coding drivers, yet few non-coding drivers are known because genome-wide discovery is challenging. We developed a driver discovery method, ActiveDriverWGS, and analyzed 120,788 cis-regulatory modules (CRMs) across 1,844 whole tumor genomes from the ICGC-TCGA PCAWG project. We found 30 CRMs with enriched SNVs and indels (FDR < 0.05). These frequently mutated regulatory elements (FMREs) were ubiquitously active in human tissues, showed long-range chromatin interactions and mRNA abundance associations with target genes, and were enriched in motif-rewiring mutations and structural variants. Genomic deletion of one FMRE in human cells caused proliferative deficiencies and transcriptional deregulation of cancer genes CCNB1IP1, CDH1, and CDKN2B, validating observations in FMRE-mutated tumors. Pathway analysis revealed further sub-significant FMREs at cancer genes and processes, indicating an unexplored landscape of infrequent driver mutations in the non-coding genome.


Asunto(s)
Biomarcadores de Tumor/genética , Cromatina/metabolismo , Redes Reguladoras de Genes , Mutación , Neoplasias/genética , Neoplasias/patología , Secuencias Reguladoras de Ácidos Nucleicos , Proliferación Celular , Cromatina/genética , Biología Computacional/métodos , Análisis Mutacional de ADN , Genoma Humano , Células HEK293 , Humanos
2.
Hum Genet ; 142(2): 181-192, 2023 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-36331656

RESUMEN

Rapid advancements of genome sequencing (GS) technologies have enhanced our understanding of the relationship between genes and human disease. To incorporate genomic information into the practice of medicine, new processes for the analysis, reporting, and communication of GS data are needed. Blood samples were collected from adults with a PCR-confirmed SARS-CoV-2 (COVID-19) diagnosis (target N = 1500). GS was performed. Data were filtered and analyzed using custom pipelines and gene panels. We developed unique patient-facing materials, including an online intake survey, group counseling presentation, and consultation letters in addition to a comprehensive GS report. The final report includes results generated from GS data: (1) monogenic disease risks; (2) carrier status; (3) pharmacogenomic variants; (4) polygenic risk scores for common conditions; (5) HLA genotype; (6) genetic ancestry; (7) blood group; and, (8) COVID-19 viral lineage. Participants complete pre-test genetic counseling and confirm preferences for secondary findings before receiving results. Counseling and referrals are initiated for clinically significant findings. We developed a genetic counseling, reporting, and return of results framework that integrates GS information across multiple areas of human health, presenting possibilities for the clinical application of comprehensive GS data in healthy individuals.


Asunto(s)
COVID-19 , Asesoramiento Genético , Adulto , Humanos , COVID-19/epidemiología , COVID-19/genética , SARS-CoV-2/genética , Genómica/métodos , Genotipo
3.
Nat Methods ; 17(12): 1191-1199, 2020 12.
Artículo en Inglés | MEDLINE | ID: mdl-33230324

RESUMEN

Probing epigenetic features on DNA has tremendous potential to advance our understanding of the phased epigenome. In this study, we use nanopore sequencing to evaluate CpG methylation and chromatin accessibility simultaneously on long strands of DNA by applying GpC methyltransferase to exogenously label open chromatin. We performed nanopore sequencing of nucleosome occupancy and methylome (nanoNOMe) on four human cell lines (GM12878, MCF-10A, MCF-7 and MDA-MB-231). The single-molecule resolution allows footprinting of protein and nucleosome binding, and determination of the combinatorial promoter epigenetic signature on individual molecules. Long-read sequencing makes it possible to robustly assign reads to haplotypes, allowing us to generate a fully phased human epigenome, consisting of chromosome-level allele-specific profiles of CpG methylation and chromatin accessibility. We further apply this to a breast cancer model to evaluate differential methylation and accessibility between cancerous and noncancerous cells.


Asunto(s)
Neoplasias de la Mama/genética , Cromatina/genética , Metilación de ADN/genética , Secuenciación de Nanoporos/métodos , Línea Celular Tumoral , Islas de CpG/genética , ADN/metabolismo , Epigenoma/genética , Femenino , Genoma Humano/genética , Humanos , Células MCF-7 , Metiltransferasas/metabolismo , Regiones Promotoras Genéticas/genética , Análisis de Secuencia de ADN
5.
Nat Methods ; 16(5): 429-436, 2019 05.
Artículo en Inglés | MEDLINE | ID: mdl-31011185

RESUMEN

Replication of eukaryotic genomes is highly stochastic, making it difficult to determine the replication dynamics of individual molecules with existing methods. We report a sequencing method for the measurement of replication fork movement on single molecules by detecting nucleotide analog signal currents on extremely long nanopore traces (D-NAscent). Using this method, we detect 5-bromodeoxyuridine (BrdU) incorporated by Saccharomyces cerevisiae to reveal, at a genomic scale and on single molecules, the DNA sequences replicated during a pulse-labeling period. Under conditions of limiting BrdU concentration, D-NAscent detects the differences in BrdU incorporation frequency across individual molecules to reveal the location of active replication origins, fork direction, termination sites, and fork pausing/stalling events. We used sequencing reads of 20-160 kilobases to generate a whole-genome single-molecule map of DNA replication dynamics and discover a class of low-frequency stochastic origins in budding yeast. The D-NAscent software is available at https://github.com/MBoemo/DNAscent.git .


Asunto(s)
Replicación del ADN , Genoma Fúngico , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Nanoporos , Saccharomyces cerevisiae/genética , Bromodesoxiuridina/metabolismo , ADN de Hongos/genética , Genoma , Programas Informáticos
6.
Nat Methods ; 16(12): 1297-1305, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31740818

RESUMEN

High-throughput complementary DNA sequencing technologies have advanced our understanding of transcriptome complexity and regulation. However, these methods lose information contained in biological RNA because the copied reads are often short and modifications are not retained. We address these limitations using a native poly(A) RNA sequencing strategy developed by Oxford Nanopore Technologies. Our study generated 9.9 million aligned sequence reads for the human cell line GM12878, using thirty MinION flow cells at six institutions. These native RNA reads had a median length of 771 bases, and a maximum aligned length of over 21,000 bases. Mitochondrial poly(A) reads provided an internal measure of read-length quality. We combined these long nanopore reads with higher accuracy short-reads and annotated GM12878 promoter regions to identify 33,984 plausible RNA isoforms. We describe strategies for assessing 3' poly(A) tail length, base modifications and transcript haplotypes.


Asunto(s)
Secuenciación de Nanoporos/métodos , Poli A/genética , Análisis de Secuencia de ARN/métodos , Transcriptoma , Células Cultivadas , Humanos
7.
Nature ; 538(7625): 378-382, 2016 Oct 20.
Artículo en Inglés | MEDLINE | ID: mdl-27732578

RESUMEN

Pancreatic cancer, a highly aggressive tumour type with uniformly poor prognosis, exemplifies the classically held view of stepwise cancer development. The current model of tumorigenesis, based on analyses of precursor lesions, termed pancreatic intraepithelial neoplasm (PanINs) lesions, makes two predictions: first, that pancreatic cancer develops through a particular sequence of genetic alterations (KRAS, followed by CDKN2A, then TP53 and SMAD4); and second, that the evolutionary trajectory of pancreatic cancer progression is gradual because each alteration is acquired independently. A shortcoming of this model is that clonally expanded precursor lesions do not always belong to the tumour lineage, indicating that the evolutionary trajectory of the tumour lineage and precursor lesions can be divergent. This prevailing model of tumorigenesis has contributed to the clinical notion that pancreatic cancer evolves slowly and presents at a late stage. However, the propensity for this disease to rapidly metastasize and the inability to improve patient outcomes, despite efforts aimed at early detection, suggest that pancreatic cancer progression is not gradual. Here, using newly developed informatics tools, we tracked changes in DNA copy number and their associated rearrangements in tumour-enriched genomes and found that pancreatic cancer tumorigenesis is neither gradual nor follows the accepted mutation order. Two-thirds of tumours harbour complex rearrangement patterns associated with mitotic errors, consistent with punctuated equilibrium as the principal evolutionary trajectory. In a subset of cases, the consequence of such errors is the simultaneous, rather than sequential, knockout of canonical preneoplastic genetic drivers that are likely to set-off invasive cancer growth. These findings challenge the current progression model of pancreatic cancer and provide insights into the mutational processes that give rise to these aggressive tumours.


Asunto(s)
Carcinogénesis/genética , Carcinogénesis/patología , Reordenamiento Génico/genética , Genoma Humano/genética , Modelos Biológicos , Mutagénesis/genética , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/patología , Carcinoma in Situ/genética , Cromotripsis , Variaciones en el Número de Copia de ADN/genética , Progresión de la Enfermedad , Evolución Molecular , Femenino , Genes Relacionados con las Neoplasias/genética , Humanos , Masculino , Mitosis/genética , Mutación/genética , Invasividad Neoplásica/genética , Invasividad Neoplásica/patología , Metástasis de la Neoplasia/genética , Metástasis de la Neoplasia/patología , Poliploidía , Lesiones Precancerosas/genética
8.
Nature ; 530(7589): 228-232, 2016 Feb 11.
Artículo en Inglés | MEDLINE | ID: mdl-26840485

RESUMEN

The Ebola virus disease epidemic in West Africa is the largest on record, responsible for over 28,599 cases and more than 11,299 deaths. Genome sequencing in viral outbreaks is desirable to characterize the infectious agent and determine its evolutionary rate. Genome sequencing also allows the identification of signatures of host adaptation, identification and monitoring of diagnostic targets, and characterization of responses to vaccines and treatments. The Ebola virus (EBOV) genome substitution rate in the Makona strain has been estimated at between 0.87 × 10(-3) and 1.42 × 10(-3) mutations per site per year. This is equivalent to 16-27 mutations in each genome, meaning that sequences diverge rapidly enough to identify distinct sub-lineages during a prolonged epidemic. Genome sequencing provides a high-resolution view of pathogen evolution and is increasingly sought after for outbreak surveillance. Sequence data may be used to guide control measures, but only if the results are generated quickly enough to inform interventions. Genomic surveillance during the epidemic has been sporadic owing to a lack of local sequencing capacity coupled with practical difficulties transporting samples to remote sequencing facilities. To address this problem, here we devise a genomic surveillance system that utilizes a novel nanopore DNA sequencing instrument. In April 2015 this system was transported in standard airline luggage to Guinea and used for real-time genomic surveillance of the ongoing epidemic. We present sequence data and analysis of 142 EBOV samples collected during the period March to October 2015. We were able to generate results less than 24 h after receiving an Ebola-positive sample, with the sequencing process taking as little as 15-60 min. We show that real-time genomic surveillance is possible in resource-limited settings and can be established rapidly to monitor outbreaks.


Asunto(s)
Ebolavirus/genética , Monitoreo Epidemiológico , Genoma Viral/genética , Fiebre Hemorrágica Ebola/epidemiología , Fiebre Hemorrágica Ebola/virología , Análisis de Secuencia de ADN/instrumentación , Análisis de Secuencia de ADN/métodos , Aeronaves , Brotes de Enfermedades/estadística & datos numéricos , Ebolavirus/clasificación , Ebolavirus/patogenicidad , Guinea/epidemiología , Humanos , Mutagénesis/genética , Tasa de Mutación , Factores de Tiempo
9.
BMC Bioinformatics ; 21(1): 343, 2020 Aug 05.
Artículo en Inglés | MEDLINE | ID: mdl-32758139

RESUMEN

BACKGROUND: Nanopore sequencing enables portable, real-time sequencing applications, including point-of-care diagnostics and in-the-field genotyping. Achieving these outcomes requires efficient bioinformatic algorithms for the analysis of raw nanopore signal data. However, comparing raw nanopore signals to a biological reference sequence is a computationally complex task. The dynamic programming algorithm called Adaptive Banded Event Alignment (ABEA) is a crucial step in polishing sequencing data and identifying non-standard nucleotides, such as measuring DNA methylation. Here, we parallelise and optimise an implementation of the ABEA algorithm (termed f5c) to efficiently run on heterogeneous CPU-GPU architectures. RESULTS: By optimising memory, computations and load balancing between CPU and GPU, we demonstrate how f5c can perform ∼3-5 × faster than an optimised version of the original CPU-only implementation of ABEA in the Nanopolish software package. We also show that f5c enables DNA methylation detection on-the-fly using an embedded System on Chip (SoC) equipped with GPUs. CONCLUSIONS: Our work not only demonstrates that complex genomics analyses can be performed on lightweight computing systems, but also benefits High-Performance Computing (HPC). The associated source code for f5c along with GPU optimised ABEA is available at https://github.com/hasindu2008/f5c .


Asunto(s)
Gráficos por Computador , Nanoporos , Procesamiento de Señales Asistido por Computador , Algoritmos , Biología Computacional , Bases de Datos como Asunto , Genoma Humano , Humanos , Análisis de Secuencia
10.
Genome Res ; 27(2): 300-309, 2017 02.
Artículo en Inglés | MEDLINE | ID: mdl-27986821

RESUMEN

We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference-based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a full-text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project. A key feature is that, as more genomes are added, identical read sequences are increasingly observed, and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out nonreference queries to search for the presence of all known viral genomes and discover human T-lymphotropic virus 1 integrations in six samples in a recognized epidemiological distribution.


Asunto(s)
Genoma Humano/genética , Genómica , Alineación de Secuencia/métodos , Secuenciación Completa del Genoma/métodos , Alelos , Compresión de Datos , Genotipo , Humanos , Mutación INDEL/genética , Análisis de Secuencia de ADN , Programas Informáticos
11.
Genome Res ; 27(5): 849-864, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28396521

RESUMEN

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.


Asunto(s)
Mapeo Contig/métodos , Genoma Humano , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Mapeo Contig/normas , Genómica/normas , Haploidia , Haplotipos , Humanos , Polimorfismo Genético , Estándares de Referencia , Análisis de Secuencia de ADN/normas
12.
Nat Methods ; 14(4): 407-410, 2017 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-28218898

RESUMEN

In nanopore sequencing devices, electrolytic current signals are sensitive to base modifications, such as 5-methylcytosine (5-mC). Here we quantified the strength of this effect for the Oxford Nanopore Technologies MinION sequencer. By using synthetically methylated DNA, we were able to train a hidden Markov model to distinguish 5-mC from unmethylated cytosine. We applied our method to sequence the methylome of human DNA, without requiring special steps for library preparation.


Asunto(s)
5-Metilcitosina/análisis , Citosina/metabolismo , Metilación de ADN , Genoma Humano , Línea Celular Tumoral , Islas de CpG , Citosina/análisis , Escherichia coli/genética , Humanos , Cadenas de Markov , Nanoporos
13.
Annu Rev Genomics Hum Genet ; 16: 153-72, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-25939056

RESUMEN

The current genomic revolution was made possible by joint advances in genome sequencing technologies and computational approaches for analyzing sequence data. The close interaction between biologists and computational scientists is perhaps most apparent in the development of approaches for sequencing entire genomes, a feat that would not be possible without sophisticated computational tools called genome assemblers (short for genome sequence assemblers). Here, we survey the key developments in algorithms for assembling genome sequences since the development of the first DNA sequencing methods more than 35 years ago.


Asunto(s)
Algoritmos , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Cromosomas Artificiales Bacterianos , Clonación Molecular , Gráficos por Computador , Genoma , Humanos
15.
Nat Methods ; 12(8): 733-5, 2015 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-26076426

RESUMEN

We have assembled de novo the Escherichia coli K-12 MG1655 chromosome in a single 4.6-Mb contig using only nanopore data. Our method has three stages: (i) overlaps are detected between reads and then corrected by a multiple-alignment process; (ii) corrected reads are assembled using the Celera Assembler; and (iii) the assembly is polished using a probabilistic model of the signal-level data. The assembly reconstructs gene order and has 99.5% nucleotide identity.


Asunto(s)
Biología Computacional/métodos , Escherichia coli K12/genética , Genoma Bacteriano , Nanoporos , Nanotecnología/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Mapeo Contig/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Reproducibilidad de los Resultados , Programas Informáticos
16.
Bioinformatics ; 33(1): 49-55, 2017 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-27614348

RESUMEN

MOTIVATION: The highly portable Oxford Nanopore MinION sequencer has enabled new applications of genome sequencing directly in the field. However, the MinION currently relies on a cloud computing platform, Metrichor (metrichor.com), for translating locally generated sequencing data into basecalls. RESULTS: To allow offline and private analysis of MinION data, we created Nanocall. Nanocall is the first freely available, open-source basecaller for Oxford Nanopore sequencing data and does not require an internet connection. Using R7.3 chemistry, on two E.coli and two human samples, with natural as well as PCR-amplified DNA, Nanocall reads have ∼68% identity, directly comparable to Metrichor '1D' data. Further, Nanocall is efficient, processing ∼2500 Kbp of sequence per core hour using the fastest settings, and fully parallelized. Using a 4 core desktop computer, Nanocall could basecall a MinION sequencing run in real time. Metrichor provides the ability to integrate the '1D' sequencing of template and complement strands of a single DNA molecule, and create a '2D' read. Nanocall does not currently integrate this technology, and addition of this capability will be an important future development. In summary, Nanocall is the first open-source, freely available, off-line basecaller for Oxford Nanopore sequencing data. AVAILABILITY AND IMPLEMENTATION: Nanocall is available at github.com/mateidavid/nanocall, released under the MIT license. CONTACT: matei.david@oicr.on.caSupplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
ADN/análisis , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Escherichia coli/genética , Humanos , Reacción en Cadena de la Polimerasa
17.
Nature ; 483(7388): 169-75, 2012 Mar 07.
Artículo en Inglés | MEDLINE | ID: mdl-22398555

RESUMEN

Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.


Asunto(s)
Evolución Molecular , Especiación Genética , Genoma/genética , Gorilla gorilla/genética , Animales , Femenino , Regulación de la Expresión Génica , Variación Genética/genética , Genómica , Humanos , Macaca mulatta/genética , Datos de Secuencia Molecular , Pan troglodytes/genética , Filogenia , Pongo/genética , Proteínas/genética , Alineación de Secuencia , Especificidad de la Especie , Transcripción Genética
19.
Mol Biol Evol ; 31(4): 872-88, 2014 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-24425782

RESUMEN

The question of how genetic variation in a population influences phenotypic variation and evolution is of major importance in modern biology. Yet much is still unknown about the relative functional importance of different forms of genome variation and how they are shaped by evolutionary processes. Here we address these questions by population level sequencing of 42 strains from the budding yeast Saccharomyces cerevisiae and its closest relative S. paradoxus. We find that genome content variation, in the form of presence or absence as well as copy number of genetic material, is higher within S. cerevisiae than within S. paradoxus, despite genetic distances as measured in single-nucleotide polymorphisms being vastly smaller within the former species. This genome content variation, as well as loss-of-function variation in the form of premature stop codons and frameshifting indels, is heavily enriched in the subtelomeres, strongly reinforcing the relevance of these regions to functional evolution. Genes affected by these likely functional forms of variation are enriched for functions mediating interaction with the external environment (sugar transport and metabolism, flocculation, metal transport, and metabolism). Our results and analyses provide a comprehensive view of genomic diversity in budding yeast and expose surprising and pronounced differences between the variation within S. cerevisiae and that within S. paradoxus. We also believe that the sequence data and de novo assemblies will constitute a useful resource for further evolutionary and population genomics studies.


Asunto(s)
Genes Fúngicos , Saccharomyces cerevisiae/genética , Arsenitos/farmacología , Variaciones en el Número de Copia de ADN , Farmacorresistencia Fúngica/genética , Evolución Molecular , Ligamiento Genético , Especiación Genética , Genoma Fúngico , Anotación de Secuencia Molecular , Familia de Multigenes , Filogenia , Polimorfismo de Nucleótido Simple , Saccharomyces cerevisiae/efectos de los fármacos , Saccharomyces cerevisiae/crecimiento & desarrollo , Análisis de Secuencia de ADN , Compuestos de Sodio/farmacología
20.
Genome Res ; 22(3): 549-56, 2012 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-22156294

RESUMEN

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.


Asunto(s)
Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Animales , Biología Computacional/métodos , Compresión de Datos , Humanos , Internet , Reproducibilidad de los Resultados
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA