Pesquisa | BVS Integralidade em Saúde

1.

AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication.

Song, Baoxing; Marco-Sola, Santiago; Moreto, Miquel; Johnson, Lynn; Buckler, Edward S; Stitzer, Michelle C.

Proc Natl Acad Sci U S A ; 119(1)2022 01 04.

Artigo em Inglês | MEDLINE | ID: mdl-34934012

RESUMO

Millions of species are currently being sequenced, and their genomes are being compared. Many of them have more complex genomes than model systems and raise novel challenges for genome alignment. Widely used local alignment strategies often produce limited or incongruous results when applied to genomes with dispersed repeats, long indels, and highly diverse sequences. Moreover, alignment using many-to-many or reciprocal best hit approaches conflicts with well-studied patterns between species with different rounds of whole-genome duplication. Here, we introduce Anchored Wavefront alignment (AnchorWave), which performs whole-genome duplication-informed collinear anchor identification between genomes and performs base pair-resolved global alignment for collinear blocks using a two-piece affine gap cost strategy. This strategy enables AnchorWave to precisely identify multikilobase indels generated by transposable element (TE) presence/absence variants (PAVs). When aligning two maize genomes, AnchorWave successfully recalled 87% of previously reported TE PAVs. By contrast, other genome alignment tools showed low power for TE PAV recall. AnchorWave precisely aligns up to three times more of the genome as position matches or indels than the closest competitive approach when comparing diverse genomes. Moreover, AnchorWave recalls transcription factor-binding sites at a rate of 1.05- to 74.85-fold higher than other tools with significantly lower false-positive alignments. AnchorWave complements available genome alignment tools by showing obvious improvement when applied to genomes with dispersed repeats, active TEs, high sequence diversity, and whole-genome duplication variation.

Assuntos

Genoma de Planta , Polimorfismo Genético , Alinhamento de Sequência , Software , Zea mays/genética

2.

Optimal gap-affine alignment in O(s) space.

Marco-Sola, Santiago; Eizenga, Jordan M; Guarracino, Andrea; Paten, Benedict; Garrison, Erik; Moreto, Miquel.

Bioinformatics ; 39(2)2023 02 03.

Artigo em Inglês | MEDLINE | ID: mdl-36749013

RESUMO

MOTIVATION: Pairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFA's O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement. RESULTS: In this article, we present the bidirectional WFA algorithm, the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFA's time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times. AVAILABILITY AND IMPLEMENTATION: All code is publicly available at https://github.com/smarco/BiWFA-paper. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Genômica , Biologia Computacional , Genoma , Análise de Sequência de DNA , Software

3.

WFA-GPU: gap-affine pairwise read-alignment using GPUs.

Aguado-Puig, Quim; Doblas, Max; Matzoros, Christos; Espinosa, Antonio; Moure, Juan Carlos; Marco-Sola, Santiago; Moreto, Miquel.

Bioinformatics ; 39(12)2023 12 01.

Artigo em Inglês | MEDLINE | ID: mdl-37975878

RESUMO

MOTIVATION: Advances in genomics and sequencing technologies demand faster and more scalable analysis methods that can process longer sequences with higher accuracy. However, classical pairwise alignment methods, based on dynamic programming (DP), impose impractical computational requirements to align long and noisy sequences like those produced by PacBio and Nanopore technologies. The recently proposed wavefront alignment (WFA) algorithm paves the way for more efficient alignment tools, improving time and memory complexity over previous methods. However, high-performance computing (HPC) platforms require efficient parallel algorithms and tools to exploit the computing resources available on modern accelerator-based architectures. RESULTS: This paper presents WFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute exact gap-affine alignments based on the WFA algorithm. We present the algorithmic adaptations and performance optimizations that allow exploiting the massively parallel capabilities of modern GPU devices to accelerate the alignment computations. In particular, we propose a CPU-GPU co-design capable of performing inter-sequence and intra-sequence parallel sequence alignment, combining a succinct WFA-data representation with an efficient GPU implementation. As a result, we demonstrate that our implementation outperforms the original multi-threaded WFA implementation by up to 4.3× and up to 18.2× when using heuristic methods on long and noisy sequences. Compared to other state-of-the-art tools and libraries, the WFA-GPU is up to 29× faster than other GPU implementations and up to four orders of magnitude faster than other CPU implementations. Furthermore, WFA-GPU is the only GPU solution capable of correctly aligning long reads using a commodity GPU. AVAILABILITY AND IMPLEMENTATION: WFA-GPU code and documentation are publicly available at https://github.com/quim0/WFA-GPU.

Assuntos

Algoritmos , Software , Análise de Sequência , Metodologias Computacionais , Genômica

4.

Fast gap-affine pairwise alignment using the wavefront algorithm.

Marco-Sola, Santiago; Moure, Juan Carlos; Moreto, Miquel; Espinosa, Antonio.

Bioinformatics ; 37(4): 456-463, 2021 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-32915952

RESUMO

MOTIVATION: Pairwise alignment of sequences is a fundamental method in modern molecular biology, implemented within multiple bioinformatics tools and libraries. Current advances in sequencing technologies press for the development of faster pairwise alignment algorithms that can scale with increasing read lengths and production yields. RESULTS: In this article, we present the wavefront alignment algorithm (WFA), an exact gap-affine algorithm that takes advantage of homologous regions between the sequences to accelerate the alignment process. As opposed to traditional dynamic programming algorithms that run in quadratic time, the WFA runs in time O(ns), proportional to the read length n and the alignment score s, using O(s2) memory. Furthermore, our algorithm exhibits simple data dependencies that can be easily vectorized, even by the automatic features of modern compilers, for different architectures, without the need to adapt the code. We evaluate the performance of our algorithm, together with other state-of-the-art implementations. As a result, we demonstrate that the WFA runs 20-300× faster than other methods aligning short Illumina-like sequences, and 10-100× faster using long noisy reads like those produced by Oxford Nanopore Technologies. AVAILABILITY AND IMPLEMENTATION: The WFA algorithm is implemented within the wavefront-aligner library, and it is publicly available at https://github.com/smarco/WFA.

Assuntos

Algoritmos , Software , Biologia Computacional , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA

5.

gemBS: high throughput processing for DNA methylation data from bisulfite sequencing.

Merkel, Angelika; Fernández-Callejo, Marcos; Casals, Eloi; Marco-Sola, Santiago; Schuyler, Ronald; Gut, Ivo G; Heath, Simon C.

Bioinformatics ; 35(5): 737-742, 2019 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-30137223

RESUMO

MOTIVATION: DNA methylation is essential for normal embryogenesis and development in mammals and can be captured at single base pair resolution by whole genome bisulfite sequencing (WGBS). Current available analysis tools are becoming rapidly outdated as they lack sensible functionality and efficiency to handle large amounts of data now commonly created. RESULTS: We developed gemBS, a fast high-throughput bioinformatics pipeline specifically designed for large scale BS-Seq analysis that combines a high performance BS-mapper (GEM3) and a variant caller specifically for BS-Seq data (BScall). gemBS provides genotype information and methylation estimates for all genomic cytosines in different contexts (CpG and non-CpG) and a set of quality reports for comprehensive and reproducible analysis. gemBS is highly modular and can be easily automated, while producing robust and accurate results. AVAILABILITY AND IMPLEMENTATION: gemBS is released under the GNU GPLv3+ license. Source code and documentation are freely available from www.statgen.cat/gemBS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Metilação de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Animais , Análise de Sequência de DNA , Software , Sulfitos

6.

ChimPipe: accurate detection of fusion genes and transcription-induced chimeras from RNA-seq data.

Rodríguez-Martín, Bernardo; Palumbo, Emilio; Marco-Sola, Santiago; Griebel, Thasso; Ribeca, Paolo; Alonso, Graciela; Rastrojo, Alberto; Aguado, Begoña; Guigó, Roderic; Djebali, Sarah.

BMC Genomics ; 18(1): 7, 2017 01 03.

Artigo em Inglês | MEDLINE | ID: mdl-28049418

RESUMO

BACKGROUND: Chimeric transcripts are commonly defined as transcripts linking two or more different genes in the genome, and can be explained by various biological mechanisms such as genomic rearrangement, read-through or trans-splicing, but also by technical or biological artefacts. Several studies have shown their importance in cancer, cell pluripotency and motility. Many programs have recently been developed to identify chimeras from Illumina RNA-seq data (mostly fusion genes in cancer). However outputs of different programs on the same dataset can be widely inconsistent, and tend to include many false positives. Other issues relate to simulated datasets restricted to fusion genes, real datasets with limited numbers of validated cases, result inconsistencies between simulated and real datasets, and gene rather than junction level assessment. RESULTS: Here we present ChimPipe, a modular and easy-to-use method to reliably identify fusion genes and transcription-induced chimeras from paired-end Illumina RNA-seq data. We have also produced realistic simulated datasets for three different read lengths, and enhanced two gold-standard cancer datasets by associating exact junction points to validated gene fusions. Benchmarking ChimPipe together with four other state-of-the-art tools on this data showed ChimPipe to be the top program at identifying exact junction coordinates for both kinds of datasets, and the one showing the best trade-off between sensitivity and precision. Applied to 106 ENCODE human RNA-seq datasets, ChimPipe identified 137 high confidence chimeras connecting the protein coding sequence of their parent genes. In subsequent experiments, three out of four predicted chimeras, two of which recurrently expressed in a large majority of the samples, could be validated. Cloning and sequencing of the three cases revealed several new chimeric transcript structures, 3 of which with the potential to encode a chimeric protein for which we hypothesized a new role. Applying ChimPipe to human and mouse ENCODE RNA-seq data led to the identification of 131 recurrent chimeras common to both species, and therefore potentially conserved. CONCLUSIONS: ChimPipe combines discordant paired-end reads and split-reads to detect any kind of chimeras, including those originating from polymerase read-through, and shows an excellent trade-off between sensitivity and precision. The chimeras found by ChimPipe can be validated in-vitro with high accuracy.

Assuntos

Proteínas de Fusão Oncogênica , Recombinação Genética , Software , Transcrição Gênica , Animais , Biologia Computacional/métodos , Simulação por Computador , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Camundongos , Reprodutibilidade dos Testes , Análise de Sequência de RNA

7.

From Wet-Lab to Variations: Concordance and Speed of Bioinformatics Pipelines for Whole Genome and Whole Exome Sequencing.

Laurie, Steve; Fernandez-Callejo, Marcos; Marco-Sola, Santiago; Trotta, Jean-Remi; Camps, Jordi; Chacón, Alejandro; Espinosa, Antonio; Gut, Marta; Gut, Ivo; Heath, Simon; Beltran, Sergi.

Hum Mutat ; 37(12): 1263-1271, 2016 12.

Artigo em Inglês | MEDLINE | ID: mdl-27604516

RESUMO

As whole genome sequencing becomes cheaper and faster, it will progressively substitute targeted next-generation sequencing as standard practice in research and diagnostics. However, computing cost-performance ratio is not advancing at an equivalent rate. Therefore, it is essential to evaluate the robustness of the variant detection process taking into account the computing resources required. We have benchmarked six combinations of state-of-the-art read aligners (BWA-MEM and GEM3) and variant callers (FreeBayes, GATK HaplotypeCaller, SAMtools) on whole genome and whole exome sequencing data from the NA12878 human sample. Results have been compared between them and against the NIST Genome in a Bottle (GIAB) variants reference dataset. We report differences in speed of up to 20 times in some steps of the process and have observed that SNV, and to a lesser extent InDel, detection is highly consistent in 70% of the genome. SNV, and especially InDel, detection is less reliable in 20% of the genome, and almost unfeasible in the remaining 10%. These findings will aid in choosing the appropriate tools bearing in mind objectives, workload, and computing infrastructure available.

Assuntos

Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Exoma , Variação Genética , Genoma Humano , Humanos , Software

8.

The GEM mapper: fast, accurate and versatile alignment by filtration.

Marco-Sola, Santiago; Sammeth, Michael; Guigó, Roderic; Ribeca, Paolo.

Nat Methods ; 9(12): 1185-8, 2012 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-23103880

RESUMO

Because of ever-increasing throughput requirements of sequencing data, most existing short-read aligners have been designed to focus on speed at the expense of accuracy. The Genome Multitool (GEM) mapper can leverage string matching by filtration to search the alignment space more efficiently, simultaneously delivering precision (performing fully tunable exhaustive searches that return all existing matches, including gapped ones) and speed (being several times faster than comparable state-of-the-art tools).

Assuntos

Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Biologia Computacional/métodos , Genoma , Genômica/métodos , Humanos , Software

9.

Building pangenome graphs.

Garrison, Erik; Guarracino, Andrea; Heumos, Simon; Villani, Flavia; Bao, Zhigui; Tattini, Lorenzo; Hagmann, Jörg; Vorbrugg, Sebastian; Marco-Sola, Santiago; Kubica, Christian; Ashbrook, David G; Thorell, Kaisa; Rusholme-Pilcher, Rachel L; Liti, Gianni; Rudbeck, Emilio; Nahnsen, Sven; Yang, Zuyu; Moses, Mwaniki N; Nobrega, Franklin L; Wu, Yi; Chen, Hao; de Ligt, Joep; Sudmant, Peter H; Soranzo, Nicole; Colonna, Vincenza; Williams, Robert W; Prins, Pjotr.

bioRxiv ; 2023 Apr 06.

Artigo em Inglês | MEDLINE | ID: mdl-37066137

RESUMO

Pangenome graphs can represent all variation between multiple genomes, but existing methods for constructing them are biased due to reference-guided approaches. In response, we have developed PanGenome Graph Builder (PGGB), a reference-free pipeline for constructing unbi-ased pangenome graphs. PGGB uses all-to-all whole-genome alignments and learned graph embeddings to build and iteratively refine a model in which we can identify variation, measure conservation, detect recombination events, and infer phylogenetic relationships.

10.

Efficient Alignment of Illumina-Like High-Throughput Sequencing Reads with the GEnomic Multi-tool (GEM) Mapper.

Marco-Sola, Santiago; Ribeca, Paolo.

Curr Protoc Bioinformatics ; 50: 11.13.1-11.13.20, 2015 Jun 19.

Artigo em Inglês | MEDLINE | ID: mdl-26094690

RESUMO

Modern Illumina-like high-throughput sequencing machines allow the cheap decoding of great amounts of DNA. The GEnomic Multi-tool (GEM) mapper is one of the fastest and most sensitive methods known to date to align such data to a known genomic reference. This unit explains how to use it effectively.

Assuntos

Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Software , Alinhamento de Sequência

11.

Boosting the FM-Index on the GPU: Effective Techniques to Mitigate Random Memory Access.

Chacón, Alejandro; Marco-Sola, Santiago; Espinosa, Antonio; Ribeca, Paolo; Moure, Juan Carlos.

IEEE/ACM Trans Comput Biol Bioinform ; 12(5): 1048-59, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26451818

RESUMO

The recent advent of high-throughput sequencing machines producing big amounts of short reads has boosted the interest in efficient string searching techniques. As of today, many mainstream sequence alignment software tools rely on a special data structure, called the FM-index, which allows for fast exact searches in large genomic references. However, such searches translate into a pseudo-random memory access pattern, thus making memory access the limiting factor of all computation-efficient implementations, both on CPUs and GPUs. Here, we show that several strategies can be put in place to remove the memory bottleneck on the GPU: more compact indexes can be implemented by having more threads work cooperatively on larger memory blocks, and a k-step FM-index can be used to further reduce the number of memory accesses. The combination of those and other optimisations yields an implementation that is able to process about two Gbases of queries per second on our test platform, being about 8 × faster than a comparable multi-core CPU version, and about 3 × to 5 × faster than the FM-index implementation on the GPU provided by the recently announced Nvidia NVBIO bioinformatics library.

Assuntos

Algoritmos , Gráficos por Computador/instrumentação , Dispositivos de Armazenamento em Computador , Sequenciamento de Nucleotídeos em Larga Escala/instrumentação , Armazenamento e Recuperação da Informação/métodos , Processamento de Sinais Assistido por Computador/instrumentação , Desenho de Equipamento , Análise de Falha de Equipamento

12.

Fast computation and applications of genome mappability.

Derrien, Thomas; Estellé, Jordi; Marco Sola, Santiago; Knowles, David G; Raineri, Emanuele; Guigó, Roderic; Ribeca, Paolo.

PLoS One ; 7(1): e30377, 2012.

Artigo em Inglês | MEDLINE | ID: mdl-22276185

RESUMO

We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).

Assuntos

Algoritmos , Biologia Computacional/métodos , Genoma Humano/genética , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa