Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
J Comput Biol ; 30(4): 469-491, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-36730750

RESUMEN

The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.


Asunto(s)
COVID-19 , Pangolines , Humanos , Animales , Pandemias , SARS-CoV-2/genética , COVID-19/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
2.
J Comput Biol ; 28(11): 1142-1155, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34698531

RESUMEN

In the recent years, there has been an increasing amount of single-cell sequencing studies, producing a considerable number of new data sets. This has particularly affected the field of cancer analysis, where more and more articles are published using this sequencing technique that allows for capturing more detailed information regarding the specific genetic mutations on each individually sampled cell. As the amount of information increases, it is necessary to have more sophisticated and rapid tools for analyzing the samples. To this goal, we developed plastic (PipeLine Amalgamating Single-cell Tree Inference Components), an easy-to-use and quick to adapt pipeline that integrates three different steps: (1) to simplify the input data, (2) to infer tumor phylogenies, and (3) to compare the phylogenies. We have created a pipeline submodule for each of those steps and developed new in-memory data structures that allow for easy and transparent sharing of the information across the tools implementing the above steps. While we use existing open source tools for those steps, we have extended the tool used for simplifying the input data, incorporating two machine learning procedures-which greatly reduce the running time without affecting the quality of the downstream analysis. Moreover, we have introduced the capability of producing some plots to quickly visualize results.


Asunto(s)
Biología Computacional/métodos , Mutación , Neoplasias/clasificación , Humanos , Internet , Neoplasias/genética , Filogenia , Análisis de Secuencia de ADN , Análisis de la Célula Individual , Programas Informáticos
3.
IEEE/ACM Trans Comput Biol Bioinform ; 16(5): 1410-1423, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31603766

RESUMEN

Most of the evolutionary history reconstruction approaches are based on the infinite sites assumption, which states that mutations appear once in the evolutionary history. The Perfect Phylogeny model is the result of the infinite sites assumption and has been widely used to infer cancer evolution. Nonetheless, recent results show that recurrent and back mutations are present in the evolutionary history of tumors, hence the Perfect Phylogeny model might be too restrictive. We propose an approach that allows losing previously acquired mutations and multiple acquisitions of a character. Moreover, we provide an ILP formulation for the evolutionary tree reconstruction problem. Our formulation allows us to tackle both the Incomplete Directed Phylogeny problem and the Clonal Reconstruction problem when general evolutionary models are considered. The latter problem is fundamental in cancer genomics, the goal is to study the evolutionary history of a tumor considering as input data the fraction of cells having a certain mutation in a set of cancer samples. For the Clonal Reconstruction problem, an experimental analysis shows the advantage of allowing mutation losses. Namely, by analyzing real and simulated datasets, our ILP approach provides a better interpretation of the evolutionary history than a Perfect Phylogeny. The software is at https://github.com/AlgoLab/gppf.


Asunto(s)
Genómica/métodos , Neoplasias , Programas Informáticos , Algoritmos , Humanos , Mutación/genética , Neoplasias/clasificación , Neoplasias/genética , Neoplasias/metabolismo , Filogenia
4.
BMC Bioinformatics ; 19(1): 444, 2018 Nov 20.
Artículo en Inglés | MEDLINE | ID: mdl-30458725

RESUMEN

BACKGROUND: While the reconstruction of transcripts from a sample of RNA-Seq data is a computationally expensive and complicated task, the detection of splicing events from RNA-Seq data and a gene annotation is computationally feasible. This latter task, which is adequate for many transcriptome analyses, is usually achieved by aligning the reads to a reference genome, followed by comparing the alignments with a gene annotation, often implicitly represented by a graph: the splicing graph. RESULTS: We present ASGAL (Alternative Splicing Graph ALigner): a tool for mapping RNA-Seq data to the splicing graph, with the specific goal of detecting novel splicing events, involving either annotated or unannotated splice sites. ASGAL takes as input the annotated transcripts of a gene and a RNA-Seq sample, and computes (1) the spliced alignments of each read in input, and (2) a list of novel events with respect to the gene annotation. CONCLUSIONS: An experimental analysis shows that ASGAL allows to enrich the annotation with novel alternative splicing events even when genes in an experiment express at most one isoform. Compared with other tools which use the spliced alignment of reads against a reference genome for differential analysis, ASGAL better predicts events that use splice sites which are novel with respect to a splicing graph, showing a higher accuracy. To the best of our knowledge, ASGAL is the first tool that detects novel alternative splicing events by directly aligning reads to a splicing graph. AVAILABILITY: Source code, documentation, and data are available for download at http://asgal.algolab.eu .


Asunto(s)
Empalme Alternativo/genética , Empalme del ARN/genética , ARN/genética , Análisis de Secuencia de ARN/métodos , Humanos
5.
J Comput Biol ; 24(10): 953-968, 2017 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-28715269

RESUMEN

The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this article, we explore a novel approach to compute the string graph, based on the FM-index and Burrows and Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the string graph assembler (SGA) as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Genoma Humano , Humanos
6.
J Comput Biol ; 23(3): 137-49, 2016 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-26953874

RESUMEN

The large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT). We have developed a disk-based algorithm for computing string graphs in external memory: the light string graph (LSG). LSG relies on a new representation of the FM-index that is exploited to use an amount of main memory requirement that is independent from the size of the data set. Moreover, we have developed a pipeline for genome assembly from NGS data that integrates LSG with the assembly step of SGA (Simpson and Durbin, 2012 ), a state-of-the-art string graph-based assembler, and uses BEETL for indexing the input data. LSG is open source software and is available online. We have analyzed our implementation on a 875-million read whole-genome dataset, on which LSG has built the string graph using only 1GB of main memory (reducing the memory occupation by a factor of 50 with respect to SGA), while requiring slightly more than twice the time than SGA. The analysis of the entire pipeline shows an important decrease in memory usage, while managing to have only a moderate increase in the running time.


Asunto(s)
Mapeo Contig/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Genoma Humano , Humanos
7.
Nucleic Acids Res ; 44(D1): D38-47, 2016 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-26538599

RESUMEN

Life sciences are yielding huge data sets that underpin scientific discoveries fundamental to improvement in human health, agriculture and the environment. In support of these discoveries, a plethora of databases and tools are deployed, in technically complex and diverse implementations, across a spectrum of scientific disciplines. The corpus of documentation of these resources is fragmented across the Web, with much redundancy, and has lacked a common standard of information. The outcome is that scientists must often struggle to find, understand, compare and use the best resources for the task at hand.Here we present a community-driven curation effort, supported by ELIXIR-the European infrastructure for biological information-that aspires to a comprehensive and consistent registry of information about bioinformatics resources. The sustainable upkeep of this Tools and Data Services Registry is assured by a curation effort driven by and tailored to local needs, and shared amongst a network of engaged partners.As of November 2015, the registry includes 1785 resources, with depositions from 126 individual registrations including 52 institutional providers and 74 individuals. With community support, the registry can become a standard for dissemination of information about bioinformatics resources: we welcome everyone to join us in this common endeavour. The registry is freely available at https://bio.tools.


Asunto(s)
Biología Computacional , Sistema de Registros , Curaduría de Datos , Programas Informáticos
8.
BMC Gastroenterol ; 14: 5, 2014 Jan 07.
Artículo en Inglés | MEDLINE | ID: mdl-24397769

RESUMEN

BACKGROUND: Data on the effect of oral bisphosphonates (BPs) on risk of upper gastrointestinal complications (UGIC) are conflicting. We conducted a large population-based study from a network of Italian healthcare utilization databases aimed to assess the UGIC risk associated with use of BPs in the setting of secondary prevention of osteoporotic fractures. METHODS: A nested case-control study was carried out within a cohort of 68,970 patients aged 45 years or older, who have been hospitalized for osteoporotic fracture from 2003 until 2005. Cases were the 804 patients who experienced hospitalization for UGIC until 2007. Up to 20 controls were randomly selected for each case. Conditional logistic regression model was used to estimate odds ratio (OR) associated with current and past use of BPs (i.e. for drug dispensation within 30 days and over 31 days prior the outcome onset, respectively) after adjusting for several covariates. RESULTS: Compared with patients who did not use BPs, current and past users had OR (and 95% confidence interval) of 0.86 (0.60 to 1.22) and 1.07 (0.80 to 1.44) respectively. There was no difference in the ORs estimated according with BPs type (alendronate or risedronate) and regimen (daily or weekly), nor with co-therapies and comorbidities. CONCLUSIONS: Further evidence that BPs dispensed for secondary prevention of osteoporotic fractures are not associated with increased risk of severe gastrointestinal complications is supplied from this study. Further research is required to clarify the role BPs and other drugs of co-medication in inducing UGIC.


Asunto(s)
Conservadores de la Densidad Ósea/administración & dosificación , Difosfonatos/administración & dosificación , Enfermedades Gastrointestinales/epidemiología , Administración Oral , Anciano , Anciano de 80 o más Años , Conservadores de la Densidad Ósea/efectos adversos , Bloqueadores de los Canales de Calcio/uso terapéutico , Estudios de Casos y Controles , Comorbilidad , Difosfonatos/efectos adversos , Femenino , Enfermedades Gastrointestinales/inducido químicamente , Humanos , Inhibidores de Hidroximetilglutaril-CoA Reductasas/uso terapéutico , Incidencia , Italia/epidemiología , Masculino , Persona de Mediana Edad , Fracturas Osteoporóticas/prevención & control , Factores de Riesgo , Prevención Secundaria
9.
J Comput Biol ; 21(1): 16-40, 2014 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-24200390

RESUMEN

Next-generation sequencing (NGS) technologies need new methodologies for alternative splicing (AS) analysis. Current computational methods for AS analysis from NGS data are mainly based on aligning short reads against a reference genome, while methods that do not need a reference genome are mostly underdeveloped. In this context, the main developed tools for NGS data focus on de novo transcriptome assembly (Grabherr et al., 2011 ; Schulz et al., 2012). While these tools are extensively applied for biological investigations and often show intrinsic shortcomings from the obtained results, a theoretical investigation of the inherent computational limits of transcriptome analysis from NGS data, when a reference genome is unknown or highly unreliable, is still missing. On the other hand, we still lack methods for computing the gene structures due to AS events under the above assumptions--a problem that we start to tackle with this article. More precisely, based on the notion of isoform graph (Lacroix et al., 2008), we define a compact representation of gene structures--called splicing graph--and investigate the computational problem of building a splicing graph that is (i) compatible with NGS data and (ii) isomorphic to the isoform graph. We characterize when there is only one representative splicing graph compatible with input data, and we propose an efficient algorithmic approach to compute this graph.


Asunto(s)
Empalme Alternativo , Modelos Genéticos , Algoritmos , Biología Computacional , Gráficos por Computador , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Polimorfismo de Nucleótido Simple , Secuencias Repetitivas de Ácidos Nucleicos , Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ARN/estadística & datos numéricos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...