Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 82
Filtrar
1.
Nat Methods ; 21(1): 41-49, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-38036856

RESUMEN

Complete, telomere-to-telomere (T2T) genome assemblies promise improved analyses and the discovery of new variants, but many essential genomic resources remain associated with older reference genomes. Thus, there is a need to translate genomic features and read alignments between references. Here we describe a method called levioSAM2 that performs fast and accurate lift-over between assemblies using a whole-genome map. In addition to enabling the use of several references, we demonstrate that aligning reads to a high-quality reference (for example, T2T-CHM13) and lifting to an older reference (for example, Genome reference Consortium (GRC)h38) improves the accuracy of the resulting variant calls on the old reference. By leveraging the quality improvements of T2T-CHM13, levioSAM2 reduces small and structural variant calling errors compared with GRC-based mapping using real short- and long-read datasets. Performance is especially improved for a set of complex medically relevant genes, where the GRC references are lower quality.


Asunto(s)
Genoma , Genómica , Análisis de Secuencia de ADN/métodos , Genómica/métodos , Mapeo Cromosómico , Secuenciación de Nucleótidos de Alto Rendimiento
2.
Genome Res ; 33(7): 1218-1227, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37414575

RESUMEN

A genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing data set. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. Although existing tools can easily compare tens of thousands of genomes, data sets can reach millions of sequences and beyond. Popular tools also fail to consider k-mer multiplicities, making them less applicable in quantitative settings. Here, we describe a method called Dashing 2 that builds on the SetSketch data structure. SetSketch is related to HyperLogLog (HLL) but discards use of leading zero count in favor of a truncated logarithm of adjustable base. Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences. It achieves superior similarity estimates for the Jaccard coefficient and average nucleotide identity compared with the original Dashing, but in much less time while using the same-sized sketch. Dashing 2 is a free, open source software.


Asunto(s)
Genómica , Programas Informáticos , Genómica/métodos , Genoma , Nucleótidos , Algoritmos , Análisis de Secuencia de ADN/métodos
3.
Genome Res ; 33(7): 1069-1077, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37258301

RESUMEN

Tools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The r-index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: [Formula: see text] space, where r is the number of Burrows-Wheeler runs. To date, the r-index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e., taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of d documents, [Formula: see text] over an alphabet of size σ, we extend the r-index with [Formula: see text] additional words to support document listing queries for a pattern [Formula: see text] that occurs in [Formula: see text] documents in D in [Formula: see text] time and [Formula: see text] space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally, we present strategies for compacting this structure in applications in which read lengths or match lengths can be bounded.


Asunto(s)
Algoritmos , Bacterias , Análisis de Secuencia , Bacterias/genética
4.
Bioinformatics ; 40(Suppl 1): i287-i296, 2024 06 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940135

RESUMEN

SUMMARY: Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics, all in linear query time without the need for seed-chain-extend. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes. Sigmoni is the first signal-based tool to scale to a complete human genome and pangenome while remaining fast enough for adaptive sampling applications. AVAILABILITY AND IMPLEMENTATION: Sigmoni is implemented in Python, and is available open-source at https://github.com/vshiv18/sigmoni.


Asunto(s)
Algoritmos , Humanos , Secuenciación de Nanoporos/métodos , Programas Informáticos , Nanoporos , Genoma Humano , Genómica/métodos , Análisis de Secuencia de ADN/métodos
5.
Nat Rev Genet ; 19(4): 208-219, 2018 04.
Artículo en Inglés | MEDLINE | ID: mdl-29379135

RESUMEN

Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.


Asunto(s)
Nube Computacional , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Internet , Biología Computacional , Humanos
6.
Nat Rev Genet ; 19(5): 325, 2018 05.
Artículo en Inglés | MEDLINE | ID: mdl-29430012

RESUMEN

This corrects the article DOI: 10.1038/nrg.2017.113.

7.
Genome Res ; 30(7): 1073-1081, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32079618

RESUMEN

Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes, including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study. This atlas greatly extends the gene annotation used in the original recount2 resource. We demonstrate the utility of the FC-R2 atlas by reproducing key findings from published large studies and by generating new results across normal and diseased human samples. In particular, we (a) identify tissue-specific transcription profiles for distinct classes of coding and noncoding genes, (b) perform differential expression analysis across thirteen cancer types, identifying novel noncoding genes potentially involved in tumor pathogenesis and progression, and (c) confirm the prognostic value for several enhancer lncRNAs expression in cancer. Our resource is instrumental for the systematic molecular characterization of lncRNA by the FANTOM6 Consortium. In conclusion, comprised of over 70,000 samples, the FC-R2 atlas will empower other researchers to investigate functions and biological roles of both known coding genes and novel lncRNAs.


Asunto(s)
Transcriptoma , Bases de Datos Genéticas , Elementos de Facilitación Genéticos , Perfilación de la Expresión Génica , Genoma Humano , Humanos , Neoplasias/genética , Especificidad de Órganos , Pronóstico , ARN Largo no Codificante/genética , ARN Largo no Codificante/metabolismo , ARN Mensajero/metabolismo
8.
Biostatistics ; 2022 Sep 05.
Artículo en Inglés | MEDLINE | ID: mdl-36063544

RESUMEN

A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or separation of the clusters, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the "scale-agnostic" $G_{+}$ discordance metric; however, this internal metric is slow to calculate for large data. Furthermore, in the setting of unsupervised clustering with $k$ groups, we show that $G_{+}$ varies as a function of the proportion of observations assigned to each of the groups (or clusters), referred to as the group balance, which is an undesirable property. To address this problem, we propose a modification of $G_{+}$, referred to as $H_{+}$, and demonstrate that $H_{+}$ does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate $H_{+}$, which are available in the $\mathtt{fasthplus}$ R package.

9.
Biostatistics ; 23(4): 1200-1217, 2022 10 14.
Artículo en Inglés | MEDLINE | ID: mdl-35358296

RESUMEN

Integrative analysis of multiple data sets has the potential of fully leveraging the vast amount of high throughput biological data being generated. In particular such analysis will be powerful in making inference from publicly available collections of genetic, transcriptomic and epigenetic data sets which are designed to study shared biological processes, but which vary in their target measurements, biological variation, unwanted noise, and batch variation. Thus, methods that enable the joint analysis of multiple data sets are needed to gain insights into shared biological processes that would otherwise be hidden by unwanted intra-data set variation. Here, we propose a method called two-stage linked component analysis (2s-LCA) to jointly decompose multiple biologically related experimental data sets with biological and technological relationships that can be structured into the decomposition. The consistency of the proposed method is established and its empirical performance is evaluated via simulation studies. We apply 2s-LCA to jointly analyze four data sets focused on human brain development and identify meaningful patterns of gene expression in human neurogenesis that have shared structure across these data sets.


Asunto(s)
Transcriptoma , Simulación por Computador , Humanos
10.
Bioinformatics ; 37(22): 4243-4245, 2021 11 18.
Artículo en Inglés | MEDLINE | ID: mdl-34037690

RESUMEN

MOTIVATION: As more population genetics datasets and population-specific references become available, the task of translating ('lifting') read alignments from one reference coordinate system to another is becoming more common. Existing tools generally require a chain file, whereas VCF files are the more common way to represent variation. Existing tools also do not make effective use of threads, creating a post-alignment bottleneck. RESULTS: LevioSAM is a tool for lifting SAM/BAM alignments from one reference to another using a VCF file containing population variants. LevioSAM uses succinct data structures and scales efficiently to many threads. When run downstream of a read aligner, levioSAM is more than 7 times faster than an aligner when both are run with 16 threads. AVAILABILITY AND IMPLEMENTATION: Software Package: https://github.com/alshai/levioSAM, Experiments: https://github.com/langmead-lab/levioSAM-experiments. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Análisis de Secuencia de ADN
11.
Bioinformatics ; 37(18): 3014-3016, 2021 09 29.
Artículo en Inglés | MEDLINE | ID: mdl-33693500

RESUMEN

MOTIVATION: A common way to summarize sequencing datasets is to quantify data lying within genes or other genomic intervals. This can be slow and can require different tools for different input file types. RESULTS: Megadepth is a fast tool for quantifying alignments and coverage for BigWig and BAM/CRAM input files, using substantially less memory than the next-fastest competitor. Megadepth can summarize coverage within all disjoint intervals of the Gencode V35 gene annotation for more than 19 000 GTExV8 BigWig files in approximately 1 h using 32 threads. Megadepth is available both as a command-line tool and as an R/Bioconductor package providing much faster quantification compared to the rtracklayer package. AVAILABILITY AND IMPLEMENTATION: https://github.com/ChristopherWilks/megadepth, https://bioconductor.org/packages/megadepth. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma , Genómica , Programas Informáticos , Anotación de Secuencia Molecular
12.
Bioinformatics ; 36(12): 3712-3718, 2020 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-32321164

RESUMEN

MOTIVATION: Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. RESULTS: Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these 'gold standard' Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-maximal exact match and vg to align more reads correctly. AVAILABILITY AND IMPLEMENTATION: Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Heurística , Secuenciación de Nucleótidos de Alto Rendimiento , Algoritmos , Genómica , Análisis de Secuencia de ADN , Programas Informáticos
13.
Nucleic Acids Res ; 47(19): e117, 2019 11 04.
Artículo en Inglés | MEDLINE | ID: mdl-31392989

RESUMEN

In the study of DNA methylation, genetic variation between species, strains or individuals can result in CpG sites that are exclusive to a subset of samples, and insertions and deletions can rearrange the spatial distribution of CpGs. How to account for this variation in an analysis of the interplay between sequence variation and DNA methylation is not well understood, especially when the number of CpG differences between samples is large. Here, we use whole-genome bisulfite sequencing data on two highly divergent mouse strains to study this problem. We show that alignment to personal genomes is necessary for valid methylation quantification. We introduce a method for including strain-specific CpGs in differential analysis, and show that this increases power. We apply our method to a human normal-cancer dataset, and show this improves accuracy and power, illustrating the broad applicability of our approach. Our method uses smoothing to impute methylation levels at strain-specific sites, thereby allowing strain-specific CpGs to contribute to the analysis, while accounting for differences in the spatial occurrences of CpGs. Our results have implications for joint analysis of genetic variation and DNA methylation using bisulfite-converted DNA, and unlocks the use of personal genomes for addressing this question.


Asunto(s)
Variación Genética/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación Completa del Genoma/métodos , Animales , Islas de CpG/genética , Metilación de ADN/genética , Epigénesis Genética , Genoma Humano/genética , Genotipo , Humanos , Ratones , Análisis de Secuencia de ADN
14.
Bioinformatics ; 35(3): 421-432, 2019 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-30020410

RESUMEN

Motivation: General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. Results: We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling. Availability and implementation: Experiments for this study: https://github.com/BenLangmead/bowtie-scaling. Bowtie: http://bowtie-bio.sourceforge.net. Bowtie 2: http://bowtie-bio.sourceforge.net/bowtie2. HISAT: http://www.ccb.jhu.edu/software/hisat. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Genómica , Programas Informáticos , Sistemas de Computación
15.
Proteomics ; 19(15): e1800315, 2019 08.
Artículo en Inglés | MEDLINE | ID: mdl-30983154

RESUMEN

Understanding the molecular profile of every human cell type is essential for understanding its role in normal physiology and disease. Technological advancements in DNA sequencing, mass spectrometry, and computational methods allow us to carry out multiomics analyses although such approaches are not routine yet. Human umbilical vein endothelial cells (HUVECs) are a widely used model system to study pathological and physiological processes associated with the cardiovascular system. In this study, next-generation sequencing and high-resolution mass spectrometry to profile the transcriptome and proteome of primary HUVECs is employed. Analysis of 145 million paired-end reads from next-generation sequencing confirmed expression of 12 186 protein-coding genes (FPKM ≥0.1), 439 novel long non-coding RNAs, and revealed 6089 novel isoforms that were not annotated in GENCODE. Proteomics analysis identifies 6477 proteins including confirmation of N-termini for 1091 proteins, isoforms for 149 proteins, and 1034 phosphosites. A database search to specifically identify other post-translational modifications provide evidence for a number of modification sites on 117 proteins which include ubiquitylation, lysine acetylation, and mono-, di- and tri-methylation events. Evidence for 11 "missing proteins," which are proteins for which there was insufficient or no protein level evidence, is provided. Peptides supporting missing protein and novel events are validated by comparison of MS/MS fragmentation patterns with synthetic peptides. Finally, 245 variant peptides derived from 207 expressed proteins in addition to alternate translational start sites for seven proteins and evidence for novel proteoforms for five proteins resulting from alternative splicing are identified. Overall, it is believed that the integrated approach employed in this study is widely applicable to study any primary cell type for deeper molecular characterization.


Asunto(s)
Proteómica/métodos , Transcriptoma/genética , Empalme Alternativo/genética , Células Endoteliales de la Vena Umbilical Humana , Humanos
16.
Bioinformatics ; 34(1): 114-116, 2018 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-28968689

RESUMEN

Motivation: As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be difficult to obtain. Results: Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70 000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria, and can score samples according to the relative frequency of different splicing patterns. We describe the software and outline biological questions that can be explored with Snaptron queries. Availability and implementation: Documentation is at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron and https://github.com/ChristopherWilks/snaptron-experiments with a CC BY-NC 4.0 license. Contact: chris.wilks@jhu.edu or langmea@cs.jhu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Empalme del ARN , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Exones , Humanos
17.
Nucleic Acids Res ; 45(2): e9, 2017 01 25.
Artículo en Inglés | MEDLINE | ID: mdl-27694310

RESUMEN

Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Programas Informáticos , Regulación de la Expresión Génica , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Anotación de Secuencia Molecular , Especificidad de Órganos/genética , Transcriptoma , Navegador Web
18.
Hum Mol Genet ; 25(22): 4962-4982, 2016 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-28171598

RESUMEN

We performed a thorough characterization of expressed repetitive element loci (RE) in the human orbitofrontal cortex (OFC) using directional RNA sequencing data. Considering only sequencing reads that map uniquely onto the human genome, we discovered that the overwhelming majority of intronic and exonic RE are expressed in the same orientation as the gene in which they reside. Our mapping approach enabled the identification of novel differentially expressed RE transcripts between the OFC and peripheral blood lymphocytes. Further analysis revealed that RE are extensively spliced into coding regions of gene transcripts yielding thousands of novel mRNA variants with altered coding potential. Lower frequency splicing of RE into untranslated regions of gene transcripts was also observed. The same pattern of RE splicing in the brain was also detected for Drosophila, zebrafish, mouse, rat, dog and rabbit. RE splicing occurs largely at canonical GT-AG splice junctions with LINE and SINE elements forming the most RE splice junctions in the human OFC. This type of splicing usually gives rise to a minor splice variant of the endogenous gene and in silico analysis suggests that RE splicing has the potential to introduce novel open reading frames. Reanalysis of previously published sequencing data performed in the mouse cerebellum revealed that thousands of RE splice variants are associated with translating ribosomes. Our results demonstrate that RE expression is more complex than previously envisioned and raise the possibility that RE splicing might generate functional protein isoforms.


Asunto(s)
Secuencias Repetitivas Esparcidas/genética , Sitios de Empalme de ARN/genética , Empalme del ARN/genética , Empalme Alternativo/genética , Animales , Secuencia de Bases , Encéfalo/metabolismo , ADN/genética , Exones , Perfilación de la Expresión Génica/métodos , Genoma/genética , Humanos , Intrones , Sistemas de Lectura Abierta/genética , Corteza Prefrontal/metabolismo , Isoformas de Proteínas/genética , ARN Mensajero/genética , Secuencias Repetitivas de Ácidos Nucleicos/genética , Análisis de Secuencia de ARN , Regiones no Traducidas/genética
19.
Annu Rev Genomics Hum Genet ; 16: 133-51, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-25939052

RESUMEN

High-throughput DNA sequencing has considerably changed the possibilities for conducting biomedical research by measuring billions of short DNA or RNA fragments. A central computational problem, and for many applications a first step, consists of determining where the fragments came from in the original genome. In this article, we review the main techniques for generating the fragments, the main applications, and the main algorithmic ideas for computing a solution to the read alignment problem. In addition, we describe pitfalls and difficulties connected to determining the correct positions of reads.


Asunto(s)
Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Genoma , Poliploidía , Secuencias Repetitivas de Ácidos Nucleicos , Programas Informáticos
20.
Nat Methods ; 12(4): 357-60, 2015 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-25751142

RESUMEN

HISAT (hierarchical indexing for spliced alignment of transcripts) is a highly efficient system for aligning reads from RNA sequencing experiments. HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index, employing two types of indexes for alignment: a whole-genome FM index to anchor each alignment and numerous local FM indexes for very rapid extensions of these alignments. HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ∼64,000 bp. Tests on real and simulated data sets showed that HISAT is the fastest system currently available, with equal or better accuracy than any other method. Despite its large number of indexes, HISAT requires only 4.3 gigabytes of memory. HISAT supports genomes of any size, including those larger than 4 billion bases.


Asunto(s)
Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/instrumentación , Humanos , Límite de Detección , Seudogenes/genética , Análisis de Secuencia de ARN
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA