Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 60
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 40(2)2024 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-38273664

RESUMO

MOTIVATION: Transcriptomic long-read (LR) sequencing is an increasingly cost-effective technology for probing various RNA features. Numerous tools have been developed to tackle various transcriptomic sequencing tasks (e.g. isoform and gene fusion detection). However, the lack of abundant gold-standard datasets hinders the benchmarking of such tools. Therefore, the simulation of LR sequencing is an important and practical alternative. While the existing LR simulators aim to imitate the sequencing machine noise and to target specific library protocols, they lack some important library preparation steps (e.g. PCR) and are difficult to modify to new and changing library preparation techniques (e.g. single-cell LRs). RESULTS: We present TKSM, a modular and scalable LR simulator, designed so that each RNA modification step is targeted explicitly by a specific module. This allows the user to assemble a simulation pipeline as a combination of TKSM modules to emulate a specific sequencing design. Additionally, the input/output of all the core modules of TKSM follows the same simple format (Molecule Description Format) allowing the user to easily extend TKSM with new modules targeting new library preparation steps. AVAILABILITY AND IMPLEMENTATION: TKSM is available as an open source software at https://github.com/vpc-ccg/tksm.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Simulação por Computador , RNA , Perfilação da Expressão Gênica
2.
Nucleic Acids Res ; 51(2): e11, 2023 01 25.
Artigo em Inglês | MEDLINE | ID: mdl-36478271

RESUMO

Alternative splicing (AS) is an important mechanism in the development of many cancers, as novel or aberrant AS patterns play an important role as an independent onco-driver. In addition, cancer-specific AS is potentially an effective target of personalized cancer therapeutics. However, detecting AS events remains a challenging task, especially if these AS events are novel. This is exacerbated by the fact that existing transcriptome annotation databases are far from being comprehensive, especially with regard to cancer-specific AS. Additionally, traditional sequencing technologies are severely limited by the short length of the generated reads, which rarely spans more than a single splice junction site. Given these challenges, transcriptomic long-read (LR) sequencing presents a promising potential for the detection and discovery of AS. We present Freddie, a computational annotation-independent isoform discovery and detection tool. Freddie takes as input transcriptomic LR sequencing of a sample alongside its genomic split alignment and computes a set of isoforms for the given sample. It then partitions the input reads into sets that can be processed independently and in parallel. For each partition, Freddie segments the genomic alignment of the reads into canonical exon segments. The goal of this segmentation is to be able to represent any potential isoform as a subset of these canonical exons. This segmentation is formulated as an optimization problem and is solved with a dynamic programming algorithm. Then, Freddie reconstructs the isoforms by jointly clustering and error-correcting the reads using the canonical segmentation as a succinct representation. The clustering and error-correcting step is formulated as an optimization problem-the Minimum Error Clustering into Isoforms (MErCi) problem-and is solved using integer linear programming (ILP). We compare the performance of Freddie on simulated datasets with other isoform detection tools with varying dependence on annotation databases. We show that Freddie outperforms the other tools in its accuracy, including those given the complete ground truth annotation. We also run Freddie on a transcriptomic LR dataset generated in-house from a prostate cancer cell line with a matched short-read RNA-seq dataset. Freddie results in isoforms with a higher short-read cross-validation rate than the other tested tools. Freddie is open source and available at https://github.com/vpc-ccg/freddie/.


Assuntos
Processamento Alternativo , Software , Transcriptoma , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , RNA-Seq , Análise de Sequência de RNA/métodos
3.
Bioinformatics ; 39(39 Suppl 1): i288-i296, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387134

RESUMO

MOTIVATION: The analysis of bacterial isolates to detect plasmids is important due to their role in the propagation of antimicrobial resistance. In short-read sequence assemblies, both plasmids and bacterial chromosomes are typically split into several contigs of various lengths, making identification of plasmids a challenging problem. In plasmid contig binning, the goal is to distinguish short-read assembly contigs based on their origin into plasmid and chromosomal contigs and subsequently sort plasmid contigs into bins, each bin corresponding to a single plasmid. Previous works on this problem consist of de novo approaches and reference-based approaches. De novo methods rely on contig features such as length, circularity, read coverage, or GC content. Reference-based approaches compare contigs to databases of known plasmids or plasmid markers from finished bacterial genomes. RESULTS: Recent developments suggest that leveraging information contained in the assembly graph improves the accuracy of plasmid binning. We present PlasBin-flow, a hybrid method that defines contig bins as subgraphs of the assembly graph. PlasBin-flow identifies such plasmid subgraphs through a mixed integer linear programming model that relies on the concept of network flow to account for sequencing coverage, while also accounting for the presence of plasmid genes and the GC content that often distinguishes plasmids from chromosomes. We demonstrate the performance of PlasBin-flow on a real dataset of bacterial samples. AVAILABILITY AND IMPLEMENTATION: https://github.com/cchauve/PlasBin-flow.


Assuntos
Algoritmos , Genoma Bacteriano , Plasmídeos/genética , Movimento Celular , Bases de Dados Factuais
4.
BMC Genomics ; 23(1): 129, 2022 Feb 14.
Artigo em Inglês | MEDLINE | ID: mdl-35164688

RESUMO

BACKGROUND: The advent of next-generation sequencing technologies empowered a wide variety of transcriptomics studies. A widely studied topic is gene fusion which is observed in many cancer types and suspected of having oncogenic properties. Gene fusions are the result of structural genomic events that bring two genes closely located and result in a fused transcript. This is different from fusion transcripts created during or after the transcription process. These chimeric transcripts are also known as read-through and trans-splicing transcripts. Gene fusion discovery with short reads is a well-studied problem, and many methods have been developed. But the sensitivity of these methods is limited by the technology, especially the short read length. Advances in long-read sequencing technologies allow the generation of long transcriptomics reads at a low cost. Transcriptomic long-read sequencing presents unique opportunities to overcome the shortcomings of short-read technologies for gene fusion detection while introducing new challenges. RESULTS: We present Genion, a sensitive and fast gene fusion detection method that can also detect read-through events. We compare Genion against a recently introduced long-read gene fusion discovery method, LongGF, both on simulated and real datasets. On simulated data, Genion accurately identifies the gene fusions and its clustering accuracy for detecting fusion reads is better than LongGF. Furthermore, our results on the breast cancer cell line MCF-7 show that Genion correctly identifies all the experimentally validated gene fusions. CONCLUSIONS: Genion is an accurate gene fusion caller. Genion is implemented in C++ and is available at https://github.com/vpc-ccg/genion .


Assuntos
Software , Transcriptoma , Fusão Gênica , Genômica , Sequenciamento de Nucleotídeos em Larga Escala
5.
Cytometry A ; 101(2): 177-184, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34559446

RESUMO

We introduce a new cell population score called SpecEnr (specific enrichment) and describe a method that discovers robust and accurate candidate biomarkers from flow cytometry data. Our approach identifies a new class of candidate biomarkers we define as driver cell populations, whose abundance is associated with a sample class (e.g., disease), but not as a result of a change in a related population. We show that the driver cell populations we find are also easily interpretable using a lattice-based visualization tool. Our method is implemented in the R package flowGraph, freely available on GitHub (github.com/aya49/flowGraph) and on BioConductor.


Assuntos
Software , Biomarcadores , Citometria de Fluxo/métodos
6.
BMC Biol ; 18(1): 1, 2020 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-31898513

RESUMO

BACKGROUND: New sequencing technologies have lowered financial barriers to whole genome sequencing, but resulting assemblies are often fragmented and far from 'finished'. Updating multi-scaffold drafts to chromosome-level status can be achieved through experimental mapping or re-sequencing efforts. Avoiding the costs associated with such approaches, comparative genomic analysis of gene order conservation (synteny) to predict scaffold neighbours (adjacencies) offers a potentially useful complementary method for improving draft assemblies. RESULTS: We evaluated and employed 3 gene synteny-based methods applied to 21 Anopheles mosquito assemblies to produce consensus sets of scaffold adjacencies. For subsets of the assemblies, we integrated these with additional supporting data to confirm and complement the synteny-based adjacencies: 6 with physical mapping data that anchor scaffolds to chromosome locations, 13 with paired-end RNA sequencing (RNAseq) data, and 3 with new assemblies based on re-scaffolding or long-read data. Our combined analyses produced 20 new superscaffolded assemblies with improved contiguities: 7 for which assignments of non-anchored scaffolds to chromosome arms span more than 75% of the assemblies, and a further 7 with chromosome anchoring including an 88% anchored Anopheles arabiensis assembly and, respectively, 73% and 84% anchored assemblies with comprehensively updated cytogenetic photomaps for Anopheles funestus and Anopheles stephensi. CONCLUSIONS: Experimental data from probe mapping, RNAseq, or long-read technologies, where available, all contribute to successful upgrading of draft assemblies. Our evaluations show that gene synteny-based computational methods represent a valuable alternative or complementary approach. Our improved Anopheles reference assemblies highlight the utility of applying comparative genomics approaches to improve community genomic resources.


Assuntos
Anopheles/genética , Evolução Biológica , Cromossomos , Técnicas Genéticas/instrumentação , Genômica/métodos , Sintenia , Animais , Mapeamento Cromossômico
7.
Bioinformatics ; 35(21): 4436-4439, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-31116364

RESUMO

MOTIVATION: Plasmids are ubiquituous in bacterial genomes, and have been shown to be involved in important evolutionary processes, in particular the acquisition of antimicrobial resistance. However separating chromosomal contigs from plasmid contigs and assembling the later is a challenging problem. RESULTS: We introduce HyAsP, a tool that identifies, bins and assembles plasmid contigs following a hybrid approach based on a database of known plasmids genes and a greedy assembly algorithm. We test HyAsP on a large sample of bacterial datasets and observe that it generally outperforms other tools. AVAILABILITY AND IMPLEMENTATION: https://github.com/cchauve/HyAsP. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Software , Algoritmos , Genoma Bacteriano , Plasmídeos , Análise de Sequência de DNA
8.
Bioinformatics ; 35(11): 1829-1836, 2019 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-30351359

RESUMO

MOTIVATION: Next-Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, e.g. for detecting low allele frequency variations from circulating tumor DNA. Barcode tagging of DNA molecules with unique molecular identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are polymerase chain reaction (PCR) amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient. RESULTS: We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments generated by substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib's default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of tentative false positive in downstream variation calling. AVAILABILITY AND IMPLEMENTATION: Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available at https://github.com/vpc-ccg/calib. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Análise por Conglomerados , DNA , Análise de Sequência de DNA
9.
J Math Biol ; 80(5): 1353-1388, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-32060618

RESUMO

Given a set of species whose evolution is represented by a species tree, a gene family is a group of genes having evolved from a single ancestral gene. A gene family evolves along the branches of a species tree through various mechanisms, including-but not limited to-speciation ([Formula: see text]), gene duplication ([Formula: see text]), gene loss ([Formula: see text]), and horizontal gene transfer ([Formula: see text]). The reconstruction of a gene tree representing the evolution of a gene family constrained by a species tree is an important problem in phylogenomics. However, unlike in the multispecies coalescent evolutionary model that considers only speciation and incomplete lineage sorting events, very little is known about the search space for gene family histories accounting for gene duplication, gene loss and horizontal gene transfer (the [Formula: see text]-model). In this work, we introduce the notion of evolutionary histories defined as a binary ordered rooted tree describing the evolution of a gene family, constrained by a species tree in the [Formula: see text]-model. We provide formal grammars describing the set of all evolutionary histories that are compatible with a given species tree, whether it is ranked or unranked. These grammars allow us, using either analytic combinatorics or dynamic programming, to efficiently compute the number of histories of a given size, and also to generate random histories of a given size under the uniform distribution. We apply these tools to obtain exact asymptotics for the number of gene family histories for two species trees, the rooted caterpillar and complete binary tree, as well as estimates of the range of the exponential growth factor of the number of histories for random species trees of size up to 25. Our results show that including horizontal gene transfers induce a dramatic increase of the number of evolutionary histories. We also show that, within ranked species trees, the number of evolutionary histories in the [Formula: see text]-model is almost independent of the species tree topology. These results establish firm foundations for the development of ensemble methods for the prediction of reconciliations.


Assuntos
Evolução Molecular , Modelos Genéticos , Algoritmos , Biologia Computacional , Simulação por Computador , Deleção de Genes , Duplicação Gênica , Transferência Genética Horizontal , Especiação Genética , Conceitos Matemáticos , Família Multigênica , Filogenia
10.
BMC Bioinformatics ; 20(Suppl 20): 637, 2019 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-31842753

RESUMO

BACKGROUND: Bacterial pathogens exhibit an impressive amount of genomic diversity. This diversity can be informative of evolutionary adaptations, host-pathogen interactions, and disease transmission patterns. However, capturing this diversity directly from biological samples is challenging. RESULTS: We introduce a framework for understanding the within-host diversity of a pathogen using multi-locus sequence types (MLST) from whole-genome sequencing (WGS) data. Our approach consists of two stages. First we process each sample individually by assigning it, for each locus in the MLST scheme, a set of alleles and a proportion for each allele. Next, we associate to each sample a set of strain types using the alleles and the strain proportions obtained in the first step. We achieve this by using the smallest possible number of previously unobserved strains across all samples, while using those unobserved strains which are as close to the observed ones as possible, at the same time respecting the allele proportions as closely as possible. We solve both problems using mixed integer linear programming (MILP). Our method performs accurately on simulated data and generates results on a real data set of Borrelia burgdorferi genomes suggesting a high level of diversity for this pathogen. CONCLUSIONS: Our approach can apply to any bacterial pathogen with an MLST scheme, even though we developed it with Borrelia burgdorferi, the etiological agent of Lyme disease, in mind. Our work paves the way for robust strain typing in the presence of within-host heterogeneity, overcoming an essential challenge currently not addressed by any existing methodology for pathogen genomics.


Assuntos
Variação Genética , Interações Hospedeiro-Patógeno/genética , Tipagem de Sequências Multilocus , Alelos , Borrelia burgdorferi/genética , Simulação por Computador , Bases de Dados Genéticas , Loci Gênicos , Modelos Biológicos
11.
Brief Bioinform ; 18(2): 306-311, 2017 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-26984616

RESUMO

BRaliBase is a widely used benchmark for assessing the accuracy of RNA secondary structure alignment methods. In most case studies based on the BRaliBase benchmark, one can observe a puzzling drop in accuracy in the 40-60% sequence identity range, the so-called 'BRaliBase Dent'. In this article, we show this dent is owing to a bias in the composition of the BRaliBase benchmark, namely the inclusion of a disproportionate number of transfer RNAs, which exhibit a conserved secondary structure. Our analysis, aside of its interest regarding the specific case of the BRaliBase benchmark, also raises important questions regarding the design and use of benchmarks in computational biology.


Assuntos
Benchmarking , Algoritmos , Biologia Computacional , Conformação de Ácido Nucleico , Alinhamento de Sequência , Análise de Sequência de RNA , Software
12.
Bioinformatics ; 34(13): 2245-2253, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29462241

RESUMO

Motivation: Identification of cell populations in flow cytometry is a critical part of the analysis and lays the groundwork for many applications and research discovery. The current paradigm of manual analysis is time consuming and subjective. A common goal of users is to replace manual analysis with automated methods that replicate their results. Supervised tools provide the best performance in such a use case, however they require fine parameterization to obtain the best results. Hence, there is a strong need for methods that are fast to setup, accurate and interpretable. Results: flowLearn is a semi-supervised approach for the quality-checked identification of cell populations. Using a very small number of manually gated samples, through density alignments it is able to predict gates on other samples with high accuracy and speed. On two state-of-the-art datasets, our tool achieves median(F1)-measures exceeding 0.99 for 31%, and 0.90 for 80% of all analyzed populations. Furthermore, users can directly interpret and adjust automated gates on new sample files to iteratively improve the initial training. Availability and implementation: FlowLearn is available as an R package on https://github.com/mlux86/flowLearn. Evaluation data is publicly available online. Details can be found in the Supplementary Material. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Citometria de Fluxo/métodos , Software
13.
BMC Genomics ; 19(Suppl 2): 96, 2018 May 09.
Artigo em Inglês | MEDLINE | ID: mdl-29764366

RESUMO

BACKGROUND: Genomes rearrangements carry valuable information for phylogenetic inference or the elucidation of molecular mechanisms of adaptation. However, the detection of genome rearrangements is often hampered by current deficiencies in data and methods: Genomes obtained from short sequence reads have generally very fragmented assemblies, and comparing multiple gene orders generally leads to computationally intractable algorithmic questions. RESULTS: We present a computational method, ADSEQ, which, by combining ancestral gene order reconstruction, comparative scaffolding and de novo scaffolding methods, overcomes these two caveats. ADSEQ provides simultaneously improved assemblies and ancestral genomes, with statistical supports on all local features. Compared to previous comparative methods, it runs in polynomial time, it samples solutions in a probabilistic space, and it can handle a significantly larger gene complement from the considered extant genomes, with complex histories including gene duplications and losses. We use ADSEQ to provide improved assemblies and a genome history made of duplications, losses, gene translocations, rearrangements, of 18 complete Anopheles genomes, including several important malaria vectors. We also provide additional support for a differentiated mode of evolution of the sex chromosome and of the autosomes in these mosquito genomes. CONCLUSIONS: We demonstrate the method's ability to improve extant assemblies accurately through a procedure simulating realistic assembly fragmentation. We study a debated issue regarding the phylogeny of the Gambiae complex group of Anopheles genomes in the light of the evolution of chromosomal rearrangements, suggesting that the phylogenetic signal they carry can differ from the phylogenetic signal carried by gene sequences, more prone to introgression.


Assuntos
Anopheles/genética , Biologia Computacional/métodos , Rearranjo Gênico , Mosquitos Vetores/genética , Algoritmos , Animais , Evolução Molecular , Ordem dos Genes , Genoma de Inseto , Filogenia , Cromossomos Sexuais/genética
14.
Bioinformatics ; 33(22): 3652-3654, 2017 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-29036421

RESUMO

MOTIVATION: Third-generation sequencing (TGS) platforms that generate long reads, such as PacBio and Oxford Nanopore technologies, have had a dramatic impact on genomics research. However, despite recent improvements, TGS reads suffer from high-error rates and the development of read correction methods is an active field of research. This motivates the need to develop tools that can evaluate the accuracy of noisy long reads correction tools. RESULTS: We introduce LRCstats, a tool that measures the accuracy of long reads correction tools. LRCstats takes advantage of long reads simulators that provide each simulated read with an alignment to the reference genome segment they originate from, and does not rely on a step of mapping corrected reads onto the reference genome. This allows for the measurement of the accuracy of the correction while being consistent with the actual errors introduced in the simulation process used to generate noisy reads. We illustrate the usefulness of LRCstats by analyzing the accuracy of four hybrid correction methods for PacBio long reads over three datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/cchauve/lrcstats. CONTACT: laseanl@sfu.ca or cedric.chauve@sfu.ca. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Sequenciamento de Nucleotídeos em Larga Escala/normas
15.
Bioinformatics ; 32(13): 2056-8, 2016 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-27153713

RESUMO

UNLABELLED: : A gene tree-species tree reconciliation explains the evolution of a gene tree within the species tree given a model of gene-family evolution. We describe ecceTERA, a program that implements a generic parsimony reconciliation algorithm, which accounts for gene duplication, loss and transfer (DTL) as well as speciation, involving sampled and unsampled lineages, within undated, fully dated or partially dated species trees. The ecceTERA reconciliation model and algorithm generalize or improve upon most published DTL parsimony algorithms for binary species trees and binary gene trees. Moreover, ecceTERA can estimate accurate species-tree aware gene trees using amalgamation. AVAILABILITY AND IMPLEMENTATION: ecceTERA is freely available under http://mbb.univ-montp2.fr/MBB/download_sources/16__ecceTERA and can be run online at http://mbb.univ-montp2.fr/MBB/subsection/softExec.php?soft=eccetera CONTACT: celine.scornavacca@umontpellier.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Evolução Molecular , Duplicação Gênica , Família Multigênica , Filogenia , Algoritmos , Modelos Teóricos
16.
Bioinformatics ; 32(17): i545-i551, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27587673

RESUMO

MOTIVATION: Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads. RESULTS: We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods. AVAILABILITY AND IMPLEMENTATION: The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap CONTACT: ehaghshe@sfu.ca or cedric.chauve@sfu.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Análise de Sequência de DNA , Biologia Computacional , Genoma , Linguagens de Programação , Alinhamento de Sequência , Software
17.
BMC Bioinformatics ; 16 Suppl 19: S6, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26696141

RESUMO

CONTEXT: The reconstruction of evolutionary scenarios for whole genomes in terms of genome rearrangements is a fundamental problem in evolutionary and comparative genomics. The DeCo algorithm, recently introduced by Bérard et al., computes parsimonious evolutionary scenarios for gene adjacencies, from pairs of reconciled gene trees. However, as for many combinatorial optimization algorithms, there can exist many co-optimal, or slightly sub-optimal, evolutionary scenarios that deserve to be considered. CONTRIBUTION: We extend the DeCo algorithm to sample evolutionary scenarios from the whole solution space under the Boltzmann distribution, and also to compute Boltzmann probabilities for specific ancestral adjacencies. RESULTS: We apply our algorithms to a dataset of mammalian gene trees and adjacencies, and observe a significant reduction of the number of syntenic conflicts observed in the resulting ancestral gene adjacencies.


Assuntos
Algoritmos , Biologia Computacional/métodos , Evolução Molecular , Filogenia , Animais , Bases de Dados Genéticas , Genoma , Mamíferos/genética , Probabilidade , Temperatura
18.
BMC Genomics ; 16 Suppl 10: S11, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26450761

RESUMO

We exploit the methodological similarity between ancestral genome reconstruction and extant genome scaffolding. We present a method, called ARt-DeCo that constructs neighborhood relationships between genes or contigs, in both ancestral and extant genomes, in a phylogenetic context. It is able to handle dozens of complete genomes, including genes with complex histories, by using gene phylogenies reconciled with a species tree, that is, annotated with speciation, duplication and loss events. Reconstructed ancestral or extant synteny comes with a support computed from an exhaustive exploration of the solution space. We compare our method with a previously published one that follows the same goal on a small number of genomes with universal unicopy genes. Then we test it on the whole Ensembl database, by proposing partial ancestral genome structures, as well as a more complete scaffolding for many partially assembled genomes on 69 eukaryote species. We carefully analyze a couple of extant adjacencies proposed by our method, and show that they are indeed real links in the extant genomes, that were missing in the current assembly. On a reduced data set of 39 eutherian mammals, we estimate the precision and sensitivity of ARt-DeCo by simulating a fragmentation in some well assembled genomes, and measure how many adjacencies are recovered. We find a very high precision, while the sensitivity depends on the quality of the data and on the proximity of closely related genomes.


Assuntos
Evolução Molecular , Genoma , Mamíferos/genética , Filogenia , Algoritmos , Animais , Duplicação Gênica , Sintenia/genética
19.
Bioinformatics ; 30(17): i519-26, 2014 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-25161242

RESUMO

MOTIVATION: Large-scale methods for inferring gene trees are error-prone. Correcting gene trees for weakly supported features often results in non-binary trees, i.e. trees with polytomies, thus raising the natural question of refining such polytomies into binary trees. A feature pointing toward potential errors in gene trees are duplications that are not supported by the presence of multiple gene copies. RESULTS: We introduce the problem of refining polytomies in a gene tree while minimizing the number of created non-apparent duplications in the resulting tree. We show that this problem can be described as a graph-theoretical optimization problem. We provide a bounded heuristic with guaranteed optimality for well-characterized instances. We apply our algorithm to a set of ray-finned fish gene trees from the Ensembl database to illustrate its ability to correct dubious duplications. AVAILABILITY AND IMPLEMENTATION: The C++ source code for the algorithms and simulations described in the article are available at http://www-ens.iro.umontreal.ca/~lafonman/software.php. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genes , Filogenia , Algoritmos , Animais , Peixes/genética
20.
Bioinformatics ; 29(23): 2987-94, 2013 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-24068034

RESUMO

MOTIVATIONS: Recent progress in ancient DNA sequencing technologies and protocols has lead to the sequencing of whole ancient bacterial genomes, as illustrated by the recent sequence of the Yersinia pestis strain that caused the Black Death pandemic. However, sequencing ancient genomes raises specific problems, because of the decay and fragmentation of ancient DNA among others, making the scaffolding of ancient contigs challenging. RESULTS: We show that computational paleogenomics methods aimed at reconstructing the organization of ancestral genomes from the comparison of extant genomes can be adapted to correct, order and orient ancient bacterial contigs. We describe the method FPSAC (fast phylogenetic scaffolding of ancient contigs) and apply it on a set of 2134 ancient contigs assembled from the recently sequenced Black Death agent genome. We obtain a unique scaffold for the whole chromosome of this ancient genome that allows to gain precise insights into the structural evolution of the Yersinia clade.


Assuntos
Mapeamento de Sequências Contíguas , Genoma Bacteriano , Filogenia , Análise de Sequência de DNA/métodos , Software , Yersinia pestis/genética , Simulação por Computador , Evolução Molecular , Peste/microbiologia , Yersinia pestis/classificação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA