Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 40
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Nat Methods ; 20(4): 550-558, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-36550274

RESUMEN

Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)-a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome.


Asunto(s)
Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Genómica/métodos , Genoma Humano , Secuencias Repetitivas de Ácidos Nucleicos
2.
Bioinformatics ; 40(5)2024 05 02.
Artículo en Inglés | MEDLINE | ID: mdl-38676570

RESUMEN

MOTIVATION: Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated. RESULTS: In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.


Asunto(s)
Algoritmos , Genoma Bacteriano , Recombinación Genética , Alineación de Secuencia , Alineación de Secuencia/métodos , Humanos , Programas Informáticos , Análisis de Secuencia de ADN/métodos , Genómica/métodos
3.
Bioinformatics ; 39(9)2023 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-37688560

RESUMEN

MOTIVATION: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as µ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the µ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, µ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. µ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.


Asunto(s)
Bancos de Muestras Biológicas , Haplotipos , Secuenciación Completa del Genoma , Reino Unido
4.
BMC Bioinformatics ; 22(Suppl 15): 625, 2022 Apr 19.
Artículo en Inglés | MEDLINE | ID: mdl-35439933

RESUMEN

BACKGROUND: Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe. RESULTS: We present MALVIRUS, an easy-to-install and easy-to-use application that assists users in multiple tasks required for the analysis of a viral population, such as the SARS-CoV-2. MALVIRUS allows to: (1) construct a variant catalog consisting in a set of variations (SNPs/indels) from the population sequences, (2) efficiently genotype and annotate variants of the catalog supported by a read sample, and (3) when the considered viral species is the SARS-CoV-2, assign the input sample to the most likely Pango lineages using the genotyped variations. CONCLUSIONS: Tests on Illumina and Nanopore samples proved the efficiency and the effectiveness of MALVIRUS in analyzing SARS-CoV-2 strain samples with respect to publicly available data provided by NCBI and the more complete dataset provided by GISAID. A comparison with state-of-the-art tools showed that MALVIRUS is always more precise and often have a better recall.


Asunto(s)
COVID-19 , Genoma Viral , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Mutación , Pandemias , Filogenia , SARS-CoV-2/genética
5.
Bioinformatics ; 37(2): 178-184, 2021 04 19.
Artículo en Inglés | MEDLINE | ID: mdl-32730595

RESUMEN

MOTIVATION: The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. RESULTS: To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. AVAILABILITY AND IMPLEMENTATION: An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Árboles , Evolución Biológica , Filogenia , Análisis de Secuencia , Programas Informáticos
6.
Bioinformatics ; 37(4): 464-472, 2021 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-32926128

RESUMEN

MOTIVATION: Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study. RESULTS: We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample, the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results. AVAILABILITY AND IMPLEMENTATION: The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Tiburones , Empalme Alternativo , Animales , RNA-Seq , Análisis de Secuencia de ARN , Tiburones/genética , Programas Informáticos
7.
Bioinformatics ; 37(3): 326-333, 2021 04 20.
Artículo en Inglés | MEDLINE | ID: mdl-32805010

RESUMEN

MOTIVATION: In recent years, the well-known Infinite Sites Assumption has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions. However, recent studies leveraging single-cell sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. While there exist established computational methods that infer phylogenies with mutation losses, there remain some advancements to be made. RESULTS: We present Simulated Annealing Single-Cell inference (SASC): a new and robust approach based on simulated annealing for the inference of cancer progression from SCS datasets. In particular, we introduce an extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of mutation loss in the evolutionary history of the tumor: the Dollo-k model. We demonstrate that SASC achieves high levels of accuracy when tested on both simulated and real datasets and in comparison with some other available methods. AVAILABILITY AND IMPLEMENTATION: The SASC tool is open source and available at https://github.com/sciccolella/sasc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Neoplasias , Análisis de la Célula Individual , Humanos , Mutación , Neoplasias/genética , Filogenia , Análisis de Secuencia , Programas Informáticos
8.
Nat Comput ; 21(1): 81-108, 2022 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-36969737

RESUMEN

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

9.
Bioinformatics ; 36(5): 1622-1624, 2020 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-31589304

RESUMEN

SUMMARY: Retroviruses and their vector derivatives integrate semi-randomly in the genome of host cells and are inherited by their progeny as stable genetic marks. The retrieval and mapping of the sequences flanking the virus-host DNA junctions allows the identification of insertion sites in gene therapy or virally infected patients, essential for monitoring the evolution of genetically modified cells in vivo. However, since ∼30% of insertions land in low complexity or repetitive regions of the host cell genome, they cannot be correctly assigned and are currently discarded, limiting the accuracy and predictive power of clonal tracking studies. Here, we present γ-TRIS, a new graph-based genome-free alignment tool for identifying insertion sites even if embedded in low complexity regions. By using γ-TRIS to reanalyze clinical studies, we observed improvements in clonal quantification and tracking. AVAILABILITY AND IMPLEMENTATION: Source code at https://bitbucket.org/bereste/g-tris. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma , Genómica , Algoritmos , Secuencia de Bases , Humanos , Programas Informáticos
10.
BMC Bioinformatics ; 21(Suppl 1): 413, 2020 Dec 09.
Artículo en Inglés | MEDLINE | ID: mdl-33297943

RESUMEN

BACKGROUND: Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored. RESULTS: We present a new tool, gpps, that reconstructs a tumor phylogeny from Single Cell Sequencing data, allowing each mutation to be lost at most a fixed number of times. The General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gpps . CONCLUSIONS: gpps provides new insights to the analysis of intra-tumor heterogeneity by proposing a new progression model to the field of cancer phylogeny reconstruction on Single Cell data.


Asunto(s)
Biología Computacional/métodos , Análisis Mutacional de ADN , Progresión de la Enfermedad , Mutación , Neoplasias/genética , Neoplasias/patología , Secuencia de Bases , Evolución Molecular , Humanos , Filogenia , Análisis de la Célula Individual
11.
BMC Bioinformatics ; 19(1): 252, 2018 07 03.
Artículo en Inglés | MEDLINE | ID: mdl-29970002

RESUMEN

BACKGROUND: Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages. RESULTS: Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes. CONCLUSIONS: Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result. AVAILABILITY: HapCHAT is available at http://hapchat.algolab.eu under the GNU Public License (GPL).


Asunto(s)
Haplotipos/genética , Análisis de Secuencia de ADN/métodos , Algoritmos , Humanos
12.
BMC Bioinformatics ; 19(1): 444, 2018 Nov 20.
Artículo en Inglés | MEDLINE | ID: mdl-30458725

RESUMEN

BACKGROUND: While the reconstruction of transcripts from a sample of RNA-Seq data is a computationally expensive and complicated task, the detection of splicing events from RNA-Seq data and a gene annotation is computationally feasible. This latter task, which is adequate for many transcriptome analyses, is usually achieved by aligning the reads to a reference genome, followed by comparing the alignments with a gene annotation, often implicitly represented by a graph: the splicing graph. RESULTS: We present ASGAL (Alternative Splicing Graph ALigner): a tool for mapping RNA-Seq data to the splicing graph, with the specific goal of detecting novel splicing events, involving either annotated or unannotated splice sites. ASGAL takes as input the annotated transcripts of a gene and a RNA-Seq sample, and computes (1) the spliced alignments of each read in input, and (2) a list of novel events with respect to the gene annotation. CONCLUSIONS: An experimental analysis shows that ASGAL allows to enrich the annotation with novel alternative splicing events even when genes in an experiment express at most one isoform. Compared with other tools which use the spliced alignment of reads against a reference genome for differential analysis, ASGAL better predicts events that use splice sites which are novel with respect to a splicing graph, showing a higher accuracy. To the best of our knowledge, ASGAL is the first tool that detects novel alternative splicing events by directly aligning reads to a splicing graph. AVAILABILITY: Source code, documentation, and data are available for download at http://asgal.algolab.eu .


Asunto(s)
Empalme Alternativo/genética , Empalme del ARN/genética , ARN/genética , Análisis de Secuencia de ARN/métodos , Humanos
13.
Bioinformatics ; 32(11): 1610-7, 2016 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-26315913

RESUMEN

MOTIVATION: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. RESULTS: By exploiting a feature of future-generation technologies-the uniform distribution of sequencing errors-we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. AVAILABILITY AND IMPLEMENTATION: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/ CONTACT: bonizzoni@disco.unimib.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Haplotipos , Algoritmos , Diploidia , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN , Programas Informáticos
14.
Bioinformatics ; 30(1): 17-23, 2014 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-23645816

RESUMEN

MOTIVATION: TANGO is one of the most accurate tools for the taxonomic assignment of sequence reads. However, because of the differences in the taxonomy structures, performing a taxonomic assignment on different reference taxonomies will produce divergent results. RESULTS: We have improved the TANGO pipeline to be able to perform the taxonomic assignment of a metagenomic sample using alternative reference taxonomies, coming from different sources. We highlight the novel pre-processing step, necessary to accomplish this task, and describe the improvements in the assignment process. We present the new TANGO pipeline in details, and, finally, we show its performance on four real metagenomic datasets and also on synthetic datasets. AVAILABILITY: The new version of TANGO, including implementation improvements and novel developments to perform the assignment on different reference taxonomies, is freely available at http://sourceforge.net/projects/taxoassignment/.


Asunto(s)
Metagenómica/métodos , Programas Informáticos , Algoritmos , Metagenómica/clasificación
15.
BMC Genomics ; 15 Suppl 6: S10, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25572381

RESUMEN

BACKGROUND: The perfect phylogeny is an often used model in phylogenetics since it provides an efficient basic procedure for representing the evolution of genomic binary characters in several frameworks, such as for example in haplotype inference. The model, which is conceptually the simplest, is based on the infinite sites assumption, that is no character can mutate more than once in the whole tree. A main open problem regarding the model is finding generalizations that retain the computational tractability of the original model but are more flexible in modeling biological data when the infinite site assumption is violated because of e.g. back mutations. A special case of back mutations that has been considered in the study of the evolution of protein domains (where a domain is acquired and then lost) is persistency, that is the fact that a character is allowed to return back to the ancestral state. In this model characters can be gained and lost at most once. In this paper we consider the computational problem of explaining binary data by the Persistent Perfect Phylogeny model (referred as PPP) and for this purpose we investigate the problem of reconstructing an evolution where some constraints are imposed on the paths of the tree. RESULTS: We define a natural generalization of the PPP problem obtained by requiring that for some pairs (character, species), neither the species nor any of its ancestors can have the character. In other words, some characters cannot be persistent for some species. This new problem is called Constrained PPP (CPPP). Based on a graph formulation of the CPPP problem, we are able to provide a polynomial time solution for the CPPP problem for matrices whose conflict graph has no edges. Using this result, we develop a parameterized algorithm for solving the CPPP problem where the parameter is the number of characters. CONCLUSIONS: A preliminary experimental analysis shows that the constrained persistent perfect phylogeny model allows to explain efficiently data that do not conform with the classical perfect phylogeny model.


Asunto(s)
Evolución Molecular , Modelos Genéticos , Filogenia , Algoritmos
16.
Nucleic Acids Res ; 39(Database issue): D80-5, 2011 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-21051348

RESUMEN

Alternative splicing is emerging as a major mechanism for the expansion of the transcriptome and proteome diversity, particularly in human and other vertebrates. However, the proportion of alternative transcripts and proteins actually endowed with functional activity is currently highly debated. We present here a new release of ASPicDB which now provides a unique annotation resource of human protein variants generated by alternative splicing. A total of 256,939 protein variants from 17,191 multi-exon genes have been extensively annotated through state of the art machine learning tools providing information of the protein type (globular and transmembrane), localization, presence of PFAM domains, signal peptides, GPI-anchor propeptides, transmembrane and coiled-coil segments. Furthermore, full-length variants can be now specifically selected based on the annotation of CAGE-tags and polyA signal and/or polyA sites, marking transcription initiation and termination sites, respectively. The retrieval can be carried out at gene, transcript, exon, protein or splice site level allowing the selection of data sets fulfilling one or more features settled by the user. The retrieval interface also enables the selection of protein variants showing specific differences in the annotated features. ASPicDB is available at http://www.caspur.it/ASPicDB/.


Asunto(s)
Empalme Alternativo , Bases de Datos Genéticas , Proteínas/química , Proteínas/genética , Exones , Variación Genética , Humanos , Isoformas de Proteínas/química , Isoformas de Proteínas/genética , ARN Mensajero/química , Análisis de Secuencia de Proteína , Interfaz Usuario-Computador
17.
Int Symp String Process Inf Retr ; 14240: 89-101, 2023 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39149146

RESUMEN

The positional Burrows-Wheeler Transform (PBWT) was presented as a means to find set-maximal exact matches (SMEMs) in haplotype data via the computation of the divergence array. Although run-length encoding the PBWT has been previously considered, storing the divergence array along with the PBWT in a compressed manner has not been as rigorously studied. We define two queries that can be used in combination to compute SMEMs, allowing us to define smaller data structures that support one or both of these queries. We combine these data structures, enabling the PBWT and the divergence array to be stored in a manner that allows for finding SMEMs. We estimate and compare the memory usage of these data structures, leading to one data structure that is most memory efficient. Lastly, we implement this data structure and compare its performance to prior methods using various datasets taken from the 1000 Genomes Project data.

18.
BMC Bioinformatics ; 13 Suppl 5: S2, 2012 Apr 12.
Artículo en Inglés | MEDLINE | ID: mdl-22537006

RESUMEN

BACKGROUND: A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task. RESULTS: We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output.The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts. CONCLUSIONS: PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/PIntron under GNU AGPL. PIntron has been shown to outperform state-of-the-art methods, and to quickly process some critical genes. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when benchmarked with ENCODE annotations.


Asunto(s)
Algoritmos , Empalme Alternativo , Etiquetas de Secuencia Expresada , Animales , Exones , Genómica , Humanos , Intrones , Alineación de Secuencia , Programas Informáticos
19.
Gigascience ; 122022 12 28.
Artículo en Inglés | MEDLINE | ID: mdl-36576129

RESUMEN

BACKGROUND: Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade. RESULTS: In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an $96.29\%$ overall accuracy, while a similar tool, Covidex, obtained a $77,12\%$ overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants. CONCLUSIONS: By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants. AVAILABILITY: The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.


Asunto(s)
COVID-19 , Aprendizaje Profundo , Puma , Animales , SARS-CoV-2/genética , Puma/genética , Genoma Viral
20.
J Comput Biol ; 29(2): 188-194, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35041518

RESUMEN

Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.


Asunto(s)
Algoritmos , Alineación de Secuencia/estadística & datos numéricos , Programas Informáticos , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Genoma Humano , Genómica/estadística & datos numéricos , Humanos , Análisis de Secuencia de ADN/estadística & datos numéricos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA