Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Bioinformatics ; 40(5)2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38676570

RESUMEN

MOTIVATION: Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated. RESULTS: In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.


Asunto(s)
Algoritmos , Genoma Bacteriano , Recombinación Genética , Alineación de Secuencia , Alineación de Secuencia/métodos , Humanos , Programas Informáticos , Análisis de Secuencia de ADN/métodos , Genómica/métodos
2.
BMC Bioinformatics ; 22(Suppl 15): 625, 2022 Apr 19.
Artículo en Inglés | MEDLINE | ID: mdl-35439933

RESUMEN

BACKGROUND: Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe. RESULTS: We present MALVIRUS, an easy-to-install and easy-to-use application that assists users in multiple tasks required for the analysis of a viral population, such as the SARS-CoV-2. MALVIRUS allows to: (1) construct a variant catalog consisting in a set of variations (SNPs/indels) from the population sequences, (2) efficiently genotype and annotate variants of the catalog supported by a read sample, and (3) when the considered viral species is the SARS-CoV-2, assign the input sample to the most likely Pango lineages using the genotyped variations. CONCLUSIONS: Tests on Illumina and Nanopore samples proved the efficiency and the effectiveness of MALVIRUS in analyzing SARS-CoV-2 strain samples with respect to publicly available data provided by NCBI and the more complete dataset provided by GISAID. A comparison with state-of-the-art tools showed that MALVIRUS is always more precise and often have a better recall.


Asunto(s)
COVID-19 , Genoma Viral , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Mutación , Pandemias , Filogenia , SARS-CoV-2/genética
3.
Nat Comput ; 21(1): 81-108, 2022 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-36969737

RESUMEN

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

4.
Bioinformatics ; 37(4): 464-472, 2021 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-32926128

RESUMEN

MOTIVATION: Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study. RESULTS: We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample, the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results. AVAILABILITY AND IMPLEMENTATION: The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Tiburones , Empalme Alternativo , Animales , RNA-Seq , Análisis de Secuencia de ARN , Tiburones/genética , Programas Informáticos
5.
Bioinformatics ; 36(5): 1622-1624, 2020 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-31589304

RESUMEN

SUMMARY: Retroviruses and their vector derivatives integrate semi-randomly in the genome of host cells and are inherited by their progeny as stable genetic marks. The retrieval and mapping of the sequences flanking the virus-host DNA junctions allows the identification of insertion sites in gene therapy or virally infected patients, essential for monitoring the evolution of genetically modified cells in vivo. However, since ∼30% of insertions land in low complexity or repetitive regions of the host cell genome, they cannot be correctly assigned and are currently discarded, limiting the accuracy and predictive power of clonal tracking studies. Here, we present γ-TRIS, a new graph-based genome-free alignment tool for identifying insertion sites even if embedded in low complexity regions. By using γ-TRIS to reanalyze clinical studies, we observed improvements in clonal quantification and tracking. AVAILABILITY AND IMPLEMENTATION: Source code at https://bitbucket.org/bereste/g-tris. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma , Genómica , Algoritmos , Secuencia de Bases , Humanos , Programas Informáticos
6.
J Comput Biol ; 26(9): 948-961, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31140836

RESUMEN

Indexing huge collections of strings, such as those produced by the widespread sequencing technologies, heavily relies on multistring generalizations of the Burrows-Wheeler transform (BWT) and the longest common prefix (LCP) array, since solving efficiently both problems are essential ingredients of several algorithms on a collection of strings, such as those for genome assembly. In this article, we explore a multithread computational strategy for building the BWT and LCP array. Our algorithm applies a divide and conquer approach that leads to parallel computation of multistring BWT and LCP array.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Análisis de Secuencia/métodos
7.
J Comput Biol ; 24(10): 953-968, 2017 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-28715269

RESUMEN

The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this article, we explore a novel approach to compute the string graph, based on the FM-index and Burrows and Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the string graph assembler (SGA) as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Genoma Humano , Humanos
8.
BMC Genomics ; 17(1): 857, 2016 11 03.
Artículo en Inglés | MEDLINE | ID: mdl-27809787

RESUMEN

BACKGROUND: SNP (single nucleotide polymorphisms) genotype data are increasingly available in cattle populations and, among other things, can be used to predict carriers of specific mutations. It is therefore convenient to have a practical statistical method for the accurate classification of individuals into carriers and non-carriers. In this paper, we compared - through cross-validation- five classification models (Lasso-penalized logistic regression -Lasso, Support Vector Machines with either linear or radial kernel -SVML and SVMR, k-nearest neighbors -KNN, and multi-allelic gene prediction -MAG), for the identification of carriers of the TUBD1 recessive mutation on BTA19 (Bos taurus autosome 19), known to be associated with high calf mortality. A population of 3116 Fleckvieh and 392 Brown Swiss animals genotyped with the 54K SNP-chip was available for the analysis. RESULTS: In general, the use of SNP genotypes proved to be very effective for the identification of mutation carriers. The best predictive models were Lasso, SVML and MAG, with an average error rate, respectively, of 0.2 %, 0.4 % and 0.6 % in Fleckvieh, and 1.2 %, 0.9 % and 1.7 % in Brown Swiss. For the three models, the false positive rate was, respectively, 0.1 %, 0.1 % and 0.2 % in Fleckvieh, and 3.0 %, 2.4 % and 1.6 % in Brown Swiss; the false negative rate was 4.4 %, 7.6 %1.0 % in Fleckvieh, and 0.0 %, 0.1% and 0.8 % in Brown Swiss. MAG appeared to be more robust to sample size reduction: with 25 % of the data, the average error rate was 0.7 % and 2.2 % in Fleckvieh and Brown Swiss, compared to 2.1 % and 5.5 % with Lasso, and 2.6 % and 12.0 % with SVML. CONCLUSIONS: The use of SNP genotypes is a very effective and efficient technique for the identification of mutation carriers in cattle populations. Very few misclassifications were observed, overall and both in the carriers and non-carriers classes. This indicates that this is a very reliable approach for potential applications in cattle breeding.


Asunto(s)
Genes Recesivos , Genotipo , Heterocigoto , Mutación , Polimorfismo de Nucleótido Simple , Algoritmos , Animales , Bovinos , Femenino , Tamización de Portadores Genéticos , Masculino , Reproducibilidad de los Resultados , Máquina de Vectores de Soporte
9.
Genom Data ; 9: 100-4, 2016 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-27486565

RESUMEN

Forskolin (FSK) induces activation of protein kinase A (PKA). This activation protects specifically some cancer cells from death induced by glucose starvation. Cell effects upon FSK treatment prompted us to investigate in detail the physiological role of PKA in the activation of pro-survival mechanisms in glucose starvation. In this regard we performed a microarray analysis of normal NIH3T3 and transformed NIH3T3-K-ras mouse fibroblasts cultured at 1 mM glucose and daily treated or not with 10 µM FSK until 72 h of growth, when the samples were collected. The microarray is deposited into Gene Expression Omnibus under Series GSE68266. The microarray data revealed that the activation of PKA regulates the expression of genes involved in metabolic, stress-response and pro-survival processes, like glutamine metabolism, autophagy and unfolded protein response, preventing cancer cell death in glucose starvation. Altogether these findings suggest that PKA activation, by inducing a complex transcriptional program, leads to cancer survival in nutrient stress, a typical feature of developing tumor. These transcriptional data, identifying this important role of PKA, will be useful to identify novel target in cancer therapy.

10.
J Comput Biol ; 23(9): 718-36, 2016 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-27280382

RESUMEN

In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.


Asunto(s)
Algoritmos , Diploidia , Genoma Humano , Haplotipos , Poliploidía , Análisis de Secuencia de ADN/métodos , Humanos , Modelos Genéticos , Polimorfismo de Nucleótido Simple
11.
J Comput Biol ; 23(3): 137-49, 2016 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-26953874

RESUMEN

The large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT). We have developed a disk-based algorithm for computing string graphs in external memory: the light string graph (LSG). LSG relies on a new representation of the FM-index that is exploited to use an amount of main memory requirement that is independent from the size of the data set. Moreover, we have developed a pipeline for genome assembly from NGS data that integrates LSG with the assembly step of SGA (Simpson and Durbin, 2012 ), a state-of-the-art string graph-based assembler, and uses BEETL for indexing the input data. LSG is open source software and is available online. We have analyzed our implementation on a 875-million read whole-genome dataset, on which LSG has built the string graph using only 1GB of main memory (reducing the memory occupation by a factor of 50 with respect to SGA), while requiring slightly more than twice the time than SGA. The analysis of the entire pipeline shows an important decrease in memory usage, while managing to have only a moderate increase in the running time.


Asunto(s)
Mapeo Contig/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Genoma Humano , Humanos
12.
PLoS Genet ; 12(3): e1005931, 2016 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-26978032

RESUMEN

Cancer cells often rely on glycolysis to obtain energy and support anabolic growth. Several studies showed that glycolytic cells are susceptible to cell death when subjected to low glucose availability or to lack of glucose. However, some cancer cells, including glycolytic ones, can efficiently acquire higher tolerance to glucose depletion, leading to their survival and aggressiveness. Although increased resistance to glucose starvation has been shown to be a consequence of signaling pathways and compensatory metabolic routes activation, the full repertoire of the underlying molecular alterations remain elusive. Using omics and computational analyses, we found that cyclic adenosine monophosphate-Protein Kinase A (cAMP-PKA) axis activation is fundamental for cancer cell resistance to glucose starvation and anoikis. Notably, here we show that such a PKA-dependent survival is mediated by parallel activation of autophagy and glutamine utilization that in concert concur to attenuate the endoplasmic reticulum (ER) stress and to sustain cell anabolism. Indeed, the inhibition of PKA-mediated autophagy or glutamine metabolism increased the level of cell death, suggesting that the induction of autophagy and metabolic rewiring by PKA is important for cancer cellular survival under glucose starvation. Importantly, both processes actively participate to cancer cell survival mediated by suspension-activated PKA as well. In addition we identify also a PKA/Src mechanism capable to protect cancer cells from anoikis. Our results reveal for the first time the role of the versatile PKA in cancer cells survival under chronic glucose starvation and anoikis and may be a novel potential target for cancer treatment.


Asunto(s)
Autofagia/genética , Proteínas Quinasas Dependientes de AMP Cíclico/biosíntesis , AMP Cíclico/genética , Neoplasias/genética , Animales , Anoicis/genética , Línea Celular Tumoral , Supervivencia Celular/genética , AMP Cíclico/metabolismo , Proteínas Quinasas Dependientes de AMP Cíclico/genética , Estrés del Retículo Endoplásmico , Glucosa/deficiencia , Glucosa/metabolismo , Glutamina/metabolismo , Glucólisis , Humanos , Ratones , Neoplasias/metabolismo , Inanición , Transcriptoma
13.
Bioinformatics ; 32(11): 1610-7, 2016 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-26315913

RESUMEN

MOTIVATION: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. RESULTS: By exploiting a feature of future-generation technologies-the uniform distribution of sequencing errors-we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. AVAILABILITY AND IMPLEMENTATION: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/ CONTACT: bonizzoni@disco.unimib.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Haplotipos , Algoritmos , Diploidia , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN , Programas Informáticos
14.
Methods Mol Biol ; 1269: 173-88, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-25577379

RESUMEN

Alternative Splicing (AS) is the molecular phenomenon whereby multiple transcripts are produced from the same gene locus. As a consequence, it is responsible for the expansion of eukaryotic transcriptomes. Aberrant AS is involved in the onset and progression of several human diseases. Therefore, the characterization of exon-intron structure of a gene and the detection of corresponding transcript isoforms is an extremely relevant biological task. Nonetheless, the computational prediction of AS events and the repertoire of alternative transcripts is yet a challenging issue. Hereafter we introduce PIntron, a software package to predict the exon-intron structure and the full-length isoforms of a gene given a genomic region and a set of transcripts (ESTs and/or mRNAs). The software is open source and available at http://pintron.algolab.eu. PIntron has been designed for (and extensively tested on) a standard workstation without requiring dedicated expensive hardware. It easily manages large genomic regions and more than 20,000 ESTs, achieving good accuracy as shown in an experimental evaluation performed on 112 well-annotated genes selected from the ENCODE human regions used as training set in the EGASP competition.


Asunto(s)
Empalme Alternativo/genética , Programas Informáticos , Transcriptoma/genética
15.
J Comput Biol ; 21(1): 16-40, 2014 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-24200390

RESUMEN

Next-generation sequencing (NGS) technologies need new methodologies for alternative splicing (AS) analysis. Current computational methods for AS analysis from NGS data are mainly based on aligning short reads against a reference genome, while methods that do not need a reference genome are mostly underdeveloped. In this context, the main developed tools for NGS data focus on de novo transcriptome assembly (Grabherr et al., 2011 ; Schulz et al., 2012). While these tools are extensively applied for biological investigations and often show intrinsic shortcomings from the obtained results, a theoretical investigation of the inherent computational limits of transcriptome analysis from NGS data, when a reference genome is unknown or highly unreliable, is still missing. On the other hand, we still lack methods for computing the gene structures due to AS events under the above assumptions--a problem that we start to tackle with this article. More precisely, based on the notion of isoform graph (Lacroix et al., 2008), we define a compact representation of gene structures--called splicing graph--and investigate the computational problem of building a splicing graph that is (i) compatible with NGS data and (ii) isomorphic to the isoform graph. We characterize when there is only one representative splicing graph compatible with input data, and we propose an efficient algorithmic approach to compute this graph.


Asunto(s)
Empalme Alternativo , Modelos Genéticos , Algoritmos , Biología Computacional , Gráficos por Computador , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Polimorfismo de Nucleótido Simple , Secuencias Repetitivas de Ácidos Nucleicos , Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ARN/estadística & datos numéricos
16.
Artículo en Inglés | MEDLINE | ID: mdl-22848137

RESUMEN

The MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances that have improved the efficiency, its applicability to real data sets has been limited since it does not take into account some important phenomena such as mutations, genotyping errors, and missing data. In this work, we propose the MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION WITH BOUNDED ERRORS problem (MRHCE), which extends the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). We describe a practical algorithm for MRHCE that is based on a reduction to the well-known Satisfiability problem (SAT) and exploits recent advances in the constraint programming literature. An experimental analysis demonstrates the biological soundness of the phasing model and the effectiveness (on both accuracy and performance) of the algorithm under several scenarios. The analysis on real data and the comparison with state-of-the-art programs reveals that our approach couples better scalability to large and complex pedigrees with the explicit inclusion of genotyping errors into the model.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Genotipo , Modelos Genéticos , Linaje , Animales , Bovinos , Femenino , Haplotipos , Masculino , Mutación , Recombinación Genética
17.
BMC Bioinformatics ; 13 Suppl 5: S2, 2012 Apr 12.
Artículo en Inglés | MEDLINE | ID: mdl-22537006

RESUMEN

BACKGROUND: A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task. RESULTS: We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output.The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts. CONCLUSIONS: PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/PIntron under GNU AGPL. PIntron has been shown to outperform state-of-the-art methods, and to quickly process some critical genes. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when benchmarked with ENCODE annotations.


Asunto(s)
Algoritmos , Empalme Alternativo , Etiquetas de Secuencia Expresada , Animales , Exones , Genómica , Humanos , Intrones , Alineación de Secuencia , Programas Informáticos
18.
Artículo en Inglés | MEDLINE | ID: mdl-21383417

RESUMEN

Haplotype Inference (HI) is a computational challenge of crucial importance in a range of genetic studies. Pedigrees allow to infer haplotypes from genotypes more accurately than population data, since Mendelian inheritance restricts the set of possible solutions. In this work, we define a new HI problem on pedigrees, called MINIMUM-CHANGE HAPLOTYPE CONFIGURATION (MCHC) problem, that allows two types of genetic variation events: recombinations and mutations. Our new formulation extends the MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION (MRHC) problem, that has been proposed in the literature to overcome the limitations of classic statistical haplotyping methods. Our contribution is twofold. First, we prove that the MCHC problem is APX-hard under several restrictions. Second, we propose an efficient and accurate heuristic algorithm for MCHC based on an L-reduction to a well-known coding problem. Our heuristic can also be used to solve the original MRHC problem and can take advantage of additional knowledge about the input genotypes. Moreover, the L-reduction proves for the first time that MCHC and MRHC are O(nm/(log nm))-approximable on general pedigrees, where n is the pedigree size and m is the genotype length. Finally, we present an extensive experimental evaluation and comparison of our heuristic algorithm with several other state-of-the-art methods for HI on pedigrees.


Asunto(s)
Algoritmos , Haplotipos/genética , Modelos Genéticos , Linaje , Recombinación Genética/genética , Biología Computacional/métodos , Mutación
19.
Artículo en Inglés | MEDLINE | ID: mdl-20498511

RESUMEN

The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper, we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we show that the problem has a polynomial time k-approximation, where k is the maximum number of xor-genotypes containing a given single nucleotide polymorphisms (SNP). Finally, we propose a heuristic and produce an experimental analysis showing that it scales to real-world large instances taken from the HapMap project.


Asunto(s)
Biología Computacional/métodos , Haplotipos , Algoritmos , Genotipo , Heterocigoto , Polimorfismo de Nucleótido Simple
20.
J Comput Biol ; 16(1): 43-66, 2009 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-19119993

RESUMEN

Alternative splicing (AS) is currently considered as one of the main mechanisms able to explain the huge gap between the number of predicted genes and the high complexity of the proteome in humans. The rapid growth of Expressed Sequence Tag (EST) data has encouraged the development of computational methods to predict alternative splicing from the analysis of EST alignment to genome sequences. EST data are also a valuable source to reconstruct the different transcript isoforms that derive from the same gene structure as a consequence of AS, as indeed EST sequences are obtained by fragmenting mRNAs from the same gene. The most recent studies on alternative splice sites detection have revealed that this topic is a quite challenging computational problem, far from a solution. The main computational issues related to the problem of detecting alternative splicing are investigated in this paper, and we analyze algorithmic solutions for this problem. We first formalize an optimization problem related to the prediction of constitutive and alternative splicing sites from EST sequences, the Minimum Exons ESTs Factorization problem (in short, MEF), and show that it is Np-hard, even for restricted instances. This problem leads us to define sets of spliced EST, that is, a set of EST factorized into their constitutive exons with respect to a gene. Then we investigate the computational problem of predicting transcript isoforms from spliced EST sequences. We propose a graph algorithm for the problem that is linear in the number of predicted isoforms and size of the graph. Finally, an experimental analysis of the method is performed to assess the reliability of the predictions.


Asunto(s)
Empalme Alternativo , Biología Computacional/métodos , Etiquetas de Secuencia Expresada , Genes , Conformación de Ácido Nucleico , Algoritmos , Secuencia de Bases , Exones , Matemática , Datos de Secuencia Molecular
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...