Pesquisa | Biblioteca Virtual em Saúde

1.

AGO, a Framework for the Reconstruction of Ancestral Syntenies and Gene Orders.

Cribbie, Evan P; Doerr, Daniel; Chauve, Cedric.

Methods Mol Biol ; 2802: 247-265, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38819563

RESUMO

Reconstructing ancestral gene orders from the genome data of extant species is an important problem in comparative and evolutionary genomics. In a phylogenomics setting that accounts for gene family evolution through gene duplication and gene loss, the reconstruction of ancestral gene orders involves several steps, including multiple sequence alignment, the inference of reconciled gene trees, and the inference of ancestral syntenies and gene adjacencies. For each of the steps of such a process, several methods can be used and implemented using a growing corpus of, often parameterized, tools; in practice, interfacing such tools into an ancestral gene order reconstruction pipeline is far from trivial. This chapter introduces AGO, a Python-based framework aimed at creating ancestral gene order reconstruction pipelines allowing to interface and parameterize different bioinformatics tools. The authors illustrate the features of AGO by reconstructing ancestral gene orders for the X chromosome of three ancestral Anopheles species using three different pipelines. AGO is freely available at https://github.com/cchauve/AGO-pipeline .

Assuntos

Evolução Molecular , Ordem dos Genes , Genômica , Filogenia , Software , Animais , Genômica/métodos , Biologia Computacional/métodos , Sintenia/genética , Anopheles/genética , Cromossomo X/genética , Alinhamento de Sequência/métodos

2.

TKSM: highly modular, user-customizable, and scalable transcriptomic sequencing long-read simulator.

Karaoglanoglu, Fatih; Orabi, Baraa; Flannigan, Ryan; Chauve, Cedric; Hach, Faraz.

Bioinformatics ; 40(2)2024 02 01.

Artigo em Inglês | MEDLINE | ID: mdl-38273664

RESUMO

MOTIVATION: Transcriptomic long-read (LR) sequencing is an increasingly cost-effective technology for probing various RNA features. Numerous tools have been developed to tackle various transcriptomic sequencing tasks (e.g. isoform and gene fusion detection). However, the lack of abundant gold-standard datasets hinders the benchmarking of such tools. Therefore, the simulation of LR sequencing is an important and practical alternative. While the existing LR simulators aim to imitate the sequencing machine noise and to target specific library protocols, they lack some important library preparation steps (e.g. PCR) and are difficult to modify to new and changing library preparation techniques (e.g. single-cell LRs). RESULTS: We present TKSM, a modular and scalable LR simulator, designed so that each RNA modification step is targeted explicitly by a specific module. This allows the user to assemble a simulation pipeline as a combination of TKSM modules to emulate a specific sequencing design. Additionally, the input/output of all the core modules of TKSM follows the same simple format (Molecule Description Format) allowing the user to easily extend TKSM with new modules targeting new library preparation steps. AVAILABILITY AND IMPLEMENTATION: TKSM is available as an open source software at https://github.com/vpc-ccg/tksm.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Simulação por Computador , RNA , Perfilação da Expressão Gênica

3.

plASgraph2: using graph neural networks to detect plasmid contigs from an assembly graph.

Sielemann, Janik; Sielemann, Katharina; Brejová, Brona; Vinar, Tomás; Chauve, Cedric.

Front Microbiol ; 14: 1267695, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37869681

RESUMO

Identification of plasmids from sequencing data is an important and challenging problem related to antimicrobial resistance spread and other One-Health issues. We provide a new architecture for identifying plasmid contigs in fragmented genome assemblies built from short-read data. We employ graph neural networks (GNNs) and the assembly graph to propagate the information from nearby nodes, which leads to more accurate classification, especially for short contigs that are difficult to classify based on sequence features or database searches alone. We trained plASgraph2 on a data set of samples from the ESKAPEE group of pathogens. plASgraph2 either outperforms or performs on par with a wide range of state-of-the-art methods on testing sets of independent ESKAPEE samples and samples from related pathogens. On one hand, our study provides a new accurate and easy to use tool for contig classification in bacterial isolates; on the other hand, it serves as a proof-of-concept for the use of GNNs in genomics. Our software is available at https://github.com/cchauve/plasgraph2 and the training and testing data sets are available at https://github.com/fmfi-compbio/plasgraph2-datasets.

4.

PlasBin-flow: a flow-based MILP algorithm for plasmid contigs binning.

Mane, Aniket; Faizrahnemoon, Mahsa; Vinar, Tomás; Brejová, Brona; Chauve, Cedric.

Bioinformatics ; 39(39 Suppl 1): i288-i296, 2023 06 30.

Artigo em Inglês | MEDLINE | ID: mdl-37387134

RESUMO

MOTIVATION: The analysis of bacterial isolates to detect plasmids is important due to their role in the propagation of antimicrobial resistance. In short-read sequence assemblies, both plasmids and bacterial chromosomes are typically split into several contigs of various lengths, making identification of plasmids a challenging problem. In plasmid contig binning, the goal is to distinguish short-read assembly contigs based on their origin into plasmid and chromosomal contigs and subsequently sort plasmid contigs into bins, each bin corresponding to a single plasmid. Previous works on this problem consist of de novo approaches and reference-based approaches. De novo methods rely on contig features such as length, circularity, read coverage, or GC content. Reference-based approaches compare contigs to databases of known plasmids or plasmid markers from finished bacterial genomes. RESULTS: Recent developments suggest that leveraging information contained in the assembly graph improves the accuracy of plasmid binning. We present PlasBin-flow, a hybrid method that defines contig bins as subgraphs of the assembly graph. PlasBin-flow identifies such plasmid subgraphs through a mixed integer linear programming model that relies on the concept of network flow to account for sequencing coverage, while also accounting for the presence of plasmid genes and the GC content that often distinguishes plasmids from chromosomes. We demonstrate the performance of PlasBin-flow on a real dataset of bacterial samples. AVAILABILITY AND IMPLEMENTATION: https://github.com/cchauve/PlasBin-flow.

Assuntos

Algoritmos , Genoma Bacteriano , Plasmídeos/genética , Movimento Celular , Bases de Dados Factuais

5.

Freddie: annotation-independent detection and discovery of transcriptomic alternative splicing isoforms using long-read sequencing.

Orabi, Baraa; Xie, Ning; McConeghy, Brian; Dong, Xuesen; Chauve, Cedric; Hach, Faraz.

Nucleic Acids Res ; 51(2): e11, 2023 01 25.

Artigo em Inglês | MEDLINE | ID: mdl-36478271

RESUMO

Alternative splicing (AS) is an important mechanism in the development of many cancers, as novel or aberrant AS patterns play an important role as an independent onco-driver. In addition, cancer-specific AS is potentially an effective target of personalized cancer therapeutics. However, detecting AS events remains a challenging task, especially if these AS events are novel. This is exacerbated by the fact that existing transcriptome annotation databases are far from being comprehensive, especially with regard to cancer-specific AS. Additionally, traditional sequencing technologies are severely limited by the short length of the generated reads, which rarely spans more than a single splice junction site. Given these challenges, transcriptomic long-read (LR) sequencing presents a promising potential for the detection and discovery of AS. We present Freddie, a computational annotation-independent isoform discovery and detection tool. Freddie takes as input transcriptomic LR sequencing of a sample alongside its genomic split alignment and computes a set of isoforms for the given sample. It then partitions the input reads into sets that can be processed independently and in parallel. For each partition, Freddie segments the genomic alignment of the reads into canonical exon segments. The goal of this segmentation is to be able to represent any potential isoform as a subset of these canonical exons. This segmentation is formulated as an optimization problem and is solved with a dynamic programming algorithm. Then, Freddie reconstructs the isoforms by jointly clustering and error-correcting the reads using the canonical segmentation as a succinct representation. The clustering and error-correcting step is formulated as an optimization problem-the Minimum Error Clustering into Isoforms (MErCi) problem-and is solved using integer linear programming (ILP). We compare the performance of Freddie on simulated datasets with other isoform detection tools with varying dependence on annotation databases. We show that Freddie outperforms the other tools in its accuracy, including those given the complete ground truth annotation. We also run Freddie on a transcriptomic LR dataset generated in-house from a prostate cancer cell line with a matched short-read RNA-seq dataset. Freddie results in isoforms with a higher short-read cross-validation rate than the other tested tools. Freddie is open source and available at https://github.com/vpc-ccg/freddie/.

Assuntos

Processamento Alternativo , Software , Transcriptoma , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , RNA-Seq , Análise de Sequência de RNA/métodos

6.

Fast and accurate matching of cellular barcodes across short-reads and long-reads of single-cell RNA-seq experiments.

Ebrahimi, Ghazal; Orabi, Baraa; Robinson, Meghan; Chauve, Cedric; Flannigan, Ryan; Hach, Faraz.

iScience ; 25(7): 104530, 2022 Jul 15.

Artigo em Inglês | MEDLINE | ID: mdl-35747387

RESUMO

Single-cell RNA sequencing allows for characterizing the gene expression landscape at the cell type level. However, because of its use of short-reads, it is severely limited at detecting full-length features of transcripts such as alternative splicing. New library preparation techniques attempt to extend single-cell sequencing by utilizing both long-reads and short-reads. These techniques split the library material, after it is tagged with cellular barcodes, into two pools: one for short-read sequencing and one for long-read sequencing. However, the challenge of utilizing these techniques is that they require matching the cellular barcodes sequenced by the erroneous long-reads to the cellular barcodes detected by the short-reads. To overcome this challenge, we introduce scTagger, a computational method to match cellular barcodes data from long-reads and short-reads. We tested scTagger against another state-of-the-art tool on both real and simulated datasets, and we demonstrate that scTagger has both significantly better accuracy and time efficiency.

7.

Genion, an accurate tool to detect gene fusion from long transcriptomics reads.

Karaoglanoglu, Fatih; Chauve, Cedric; Hach, Faraz.

BMC Genomics ; 23(1): 129, 2022 Feb 14.

Artigo em Inglês | MEDLINE | ID: mdl-35164688

RESUMO

BACKGROUND: The advent of next-generation sequencing technologies empowered a wide variety of transcriptomics studies. A widely studied topic is gene fusion which is observed in many cancer types and suspected of having oncogenic properties. Gene fusions are the result of structural genomic events that bring two genes closely located and result in a fused transcript. This is different from fusion transcripts created during or after the transcription process. These chimeric transcripts are also known as read-through and trans-splicing transcripts. Gene fusion discovery with short reads is a well-studied problem, and many methods have been developed. But the sensitivity of these methods is limited by the technology, especially the short read length. Advances in long-read sequencing technologies allow the generation of long transcriptomics reads at a low cost. Transcriptomic long-read sequencing presents unique opportunities to overcome the shortcomings of short-read technologies for gene fusion detection while introducing new challenges. RESULTS: We present Genion, a sensitive and fast gene fusion detection method that can also detect read-through events. We compare Genion against a recently introduced long-read gene fusion discovery method, LongGF, both on simulated and real datasets. On simulated data, Genion accurately identifies the gene fusions and its clustering accuracy for detecting fusion reads is better than LongGF. Furthermore, our results on the breast cancer cell line MCF-7 show that Genion correctly identifies all the experimentally validated gene fusions. CONCLUSIONS: Genion is an accurate gene fusion caller. Genion is implemented in C++ and is available at https://github.com/vpc-ccg/genion .

Assuntos

Software , Transcriptoma , Fusão Gênica , Genômica , Sequenciamento de Nucleotídeos em Larga Escala

8.

Automated identification of maximal differential cell populations in flow cytometry data.

Yue, Alice; Chauve, Cedric; Libbrecht, Maxwell W; Brinkman, Ryan R.

Cytometry A ; 101(2): 177-184, 2022 02.

Artigo em Inglês | MEDLINE | ID: mdl-34559446

RESUMO

We introduce a new cell population score called SpecEnr (specific enrichment) and describe a method that discovers robust and accurate candidate biomarkers from flow cytometry data. Our approach identifies a new class of candidate biomarkers we define as driver cell populations, whose abundance is associated with a sample class (e.g., disease), but not as a result of a change in a related population. We show that the driver cell populations we find are also easily interpretable using a lattice-based visualization tool. Our method is implemented in the R package flowGraph, freely available on GitHub (github.com/aya49/flowGraph) and on BioConductor.

Assuntos

Software , Biomarcadores , Citometria de Fluxo/métodos

9.

Small parsimony for natural genomes in the DCJ-indel model.

Doerr, Daniel; Chauve, Cedric.

J Bioinform Comput Biol ; 19(6): 2140009, 2021 12.

Artigo em Inglês | MEDLINE | ID: mdl-34806948

RESUMO

The Small Parsimony Problem (SPP) aims at finding the gene orders at internal nodes of a given phylogenetic tree such that the overall genome rearrangement distance along the tree branches is minimized. This problem is intractable in most genome rearrangement models, especially when gene duplication and loss are considered. In this work, we describe an Integer Linear Program algorithm to solve the SPP for natural genomes, i.e. genomes that contain conserved, unique, and duplicated markers. The evolutionary model that we consider is the DCJ-indel model that includes the Double-Cut and Join rearrangement operation and the insertion and deletion of genome segments. We evaluate our algorithm on simulated data and show that it is able to reconstruct very efficiently and accurately ancestral gene orders in a very comprehensive evolutionary model.

Assuntos

Genoma , Modelos Genéticos , Algoritmos , Evolução Biológica , Evolução Molecular , Rearranjo Gênico , Filogenia

10.

HASLR: Fast Hybrid Assembly of Long Reads.

Haghshenas, Ehsan; Asghari, Hossein; Stoye, Jens; Chauve, Cedric; Hach, Faraz.

iScience ; 23(8): 101389, 2020 Aug 21.

Artigo em Inglês | MEDLINE | ID: mdl-32781410

RESUMO

Third-generation sequencing technologies from companies such as Oxford Nanopore and Pacific Biosciences have paved the way for building more contiguous and potentially gap-free assemblies. The larger effective length of their reads has provided a means to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive, whereas faster methods are not as accurate. Moreover, despite recent advances in third-generation sequencing, researchers still tend to generate accurate short reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler that uses error-prone long reads together with high-quality short reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on most of the samples, while being on par with other assemblers in terms of contiguity and accuracy.

11.

The distance and median problems in the single-cut-or-join model with single-gene duplications.

Mane, Aniket C; Lafond, Manuel; Feijao, Pedro C; Chauve, Cedric.

Algorithms Mol Biol ; 15: 8, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32391071

RESUMO

BACKGROUND: In the field of genome rearrangement algorithms, models accounting for gene duplication lead often to hard problems. For example, while computing the pairwise distance is tractable in most duplication-free models, the problem is NP-complete for most extensions of these models accounting for duplicated genes. Moreover, problems involving more than two genomes, such as the genome median and the Small Parsimony problem, are intractable for most duplication-free models, with some exceptions, for example the Single-Cut-or-Join (SCJ) model. RESULTS: We introduce a variant of the SCJ distance that accounts for duplicated genes, in the context of directed evolution from an ancestral genome to a descendant genome where orthology relations between ancestral genes and their descendant are known. Our model includes two duplication mechanisms: single-gene tandem duplication and the creation of single-gene circular chromosomes. We prove that in this model, computing the directed distance and a parsimonious evolutionary scenario in terms of SCJ and single-gene duplication events can be done in linear time. We also show that the directed median problem is tractable for this distance, while the rooted median problem, where we assume that one of the given genomes is ancestral to the median, is NP-complete. We also describe an Integer Linear Program for solving this problem. We evaluate the directed distance and rooted median algorithms on simulated data. CONCLUSION: Our results provide a simple genome rearrangement model, extending the SCJ model to account for single-gene duplications, for which we prove a mix of tractability and hardness results. For the NP-complete rooted median problem, we design a simple Integer Linear Program. Our publicly available implementation of these algorithms for the directed distance and median problems allow to solve efficiently these problems on large instances.

12.

Counting and sampling gene family evolutionary histories in the duplication-loss and duplication-loss-transfer models.

Chauve, Cedric; Ponty, Yann; Wallner, Michael.

J Math Biol ; 80(5): 1353-1388, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-32060618

RESUMO

Given a set of species whose evolution is represented by a species tree, a gene family is a group of genes having evolved from a single ancestral gene. A gene family evolves along the branches of a species tree through various mechanisms, including-but not limited to-speciation ([Formula: see text]), gene duplication ([Formula: see text]), gene loss ([Formula: see text]), and horizontal gene transfer ([Formula: see text]). The reconstruction of a gene tree representing the evolution of a gene family constrained by a species tree is an important problem in phylogenomics. However, unlike in the multispecies coalescent evolutionary model that considers only speciation and incomplete lineage sorting events, very little is known about the search space for gene family histories accounting for gene duplication, gene loss and horizontal gene transfer (the [Formula: see text]-model). In this work, we introduce the notion of evolutionary histories defined as a binary ordered rooted tree describing the evolution of a gene family, constrained by a species tree in the [Formula: see text]-model. We provide formal grammars describing the set of all evolutionary histories that are compatible with a given species tree, whether it is ranked or unranked. These grammars allow us, using either analytic combinatorics or dynamic programming, to efficiently compute the number of histories of a given size, and also to generate random histories of a given size under the uniform distribution. We apply these tools to obtain exact asymptotics for the number of gene family histories for two species trees, the rooted caterpillar and complete binary tree, as well as estimates of the range of the exponential growth factor of the number of histories for random species trees of size up to 25. Our results show that including horizontal gene transfers induce a dramatic increase of the number of evolutionary histories. We also show that, within ranked species trees, the number of evolutionary histories in the [Formula: see text]-model is almost independent of the species tree topology. These results establish firm foundations for the development of ensemble methods for the prediction of reconciliations.

Assuntos

Evolução Molecular , Modelos Genéticos , Algoritmos , Biologia Computacional , Simulação por Computador , Deleção de Genes , Duplicação Gênica , Transferência Genética Horizontal , Especiação Genética , Conceitos Matemáticos , Família Multigênica , Filogenia

13.

Evolutionary superscaffolding and chromosome anchoring to improve Anopheles genome assemblies.

Waterhouse, Robert M; Aganezov, Sergey; Anselmetti, Yoann; Lee, Jiyoung; Ruzzante, Livio; Reijnders, Maarten J M F; Feron, Romain; Bérard, Sèverine; George, Phillip; Hahn, Matthew W; Howell, Paul I; Kamali, Maryam; Koren, Sergey; Lawson, Daniel; Maslen, Gareth; Peery, Ashley; Phillippy, Adam M; Sharakhova, Maria V; Tannier, Eric; Unger, Maria F; Zhang, Simo V; Alekseyev, Max A; Besansky, Nora J; Chauve, Cedric; Emrich, Scott J; Sharakhov, Igor V.

BMC Biol ; 18(1): 1, 2020 01 02.

Artigo em Inglês | MEDLINE | ID: mdl-31898513

RESUMO

BACKGROUND: New sequencing technologies have lowered financial barriers to whole genome sequencing, but resulting assemblies are often fragmented and far from 'finished'. Updating multi-scaffold drafts to chromosome-level status can be achieved through experimental mapping or re-sequencing efforts. Avoiding the costs associated with such approaches, comparative genomic analysis of gene order conservation (synteny) to predict scaffold neighbours (adjacencies) offers a potentially useful complementary method for improving draft assemblies. RESULTS: We evaluated and employed 3 gene synteny-based methods applied to 21 Anopheles mosquito assemblies to produce consensus sets of scaffold adjacencies. For subsets of the assemblies, we integrated these with additional supporting data to confirm and complement the synteny-based adjacencies: 6 with physical mapping data that anchor scaffolds to chromosome locations, 13 with paired-end RNA sequencing (RNAseq) data, and 3 with new assemblies based on re-scaffolding or long-read data. Our combined analyses produced 20 new superscaffolded assemblies with improved contiguities: 7 for which assignments of non-anchored scaffolds to chromosome arms span more than 75% of the assemblies, and a further 7 with chromosome anchoring including an 88% anchored Anopheles arabiensis assembly and, respectively, 73% and 84% anchored assemblies with comprehensively updated cytogenetic photomaps for Anopheles funestus and Anopheles stephensi. CONCLUSIONS: Experimental data from probe mapping, RNAseq, or long-read technologies, where available, all contribute to successful upgrading of draft assemblies. Our evaluations show that gene synteny-based computational methods represent a valuable alternative or complementary approach. Our improved Anopheles reference assemblies highlight the utility of applying comparative genomics approaches to improve community genomic resources.

Assuntos

Anopheles/genética , Evolução Biológica , Cromossomos , Técnicas Genéticas/instrumentação , Genômica/métodos , Sintenia , Animais , Mapeamento Cromossômico

14.

Deconvoluting the diversity of within-host pathogen strains in a multi-locus sequence typing framework.

Gan, Guo Liang; Willie, Elijah; Chauve, Cedric; Chindelevitch, Leonid.

BMC Bioinformatics ; 20(Suppl 20): 637, 2019 Dec 17.

Artigo em Inglês | MEDLINE | ID: mdl-31842753

RESUMO

BACKGROUND: Bacterial pathogens exhibit an impressive amount of genomic diversity. This diversity can be informative of evolutionary adaptations, host-pathogen interactions, and disease transmission patterns. However, capturing this diversity directly from biological samples is challenging. RESULTS: We introduce a framework for understanding the within-host diversity of a pathogen using multi-locus sequence types (MLST) from whole-genome sequencing (WGS) data. Our approach consists of two stages. First we process each sample individually by assigning it, for each locus in the MLST scheme, a set of alleles and a proportion for each allele. Next, we associate to each sample a set of strain types using the alleles and the strain proportions obtained in the first step. We achieve this by using the smallest possible number of previously unobserved strains across all samples, while using those unobserved strains which are as close to the observed ones as possible, at the same time respecting the allele proportions as closely as possible. We solve both problems using mixed integer linear programming (MILP). Our method performs accurately on simulated data and generates results on a real data set of Borrelia burgdorferi genomes suggesting a high level of diversity for this pathogen. CONCLUSIONS: Our approach can apply to any bacterial pathogen with an MLST scheme, even though we developed it with Borrelia burgdorferi, the etiological agent of Lyme disease, in mind. Our work paves the way for robust strain typing in the presence of within-host heterogeneity, overcoming an essential challenge currently not addressed by any existing methodology for pathogen genomics.

Assuntos

Variação Genética , Interações Hospedeiro-Patógeno/genética , Tipagem de Sequências Multilocus , Alelos , Borrelia burgdorferi/genética , Simulação por Computador , Bases de Dados Genéticas , Loci Gênicos , Modelos Biológicos

15.

HyAsP, a greedy tool for plasmids identification.

Müller, Robert; Chauve, Cedric.

Bioinformatics ; 35(21): 4436-4439, 2019 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-31116364

RESUMO

MOTIVATION: Plasmids are ubiquituous in bacterial genomes, and have been shown to be involved in important evolutionary processes, in particular the acquisition of antimicrobial resistance. However separating chromosomal contigs from plasmid contigs and assembling the later is a challenging problem. RESULTS: We introduce HyAsP, a tool that identifies, bins and assembles plasmid contigs following a hybrid approach based on a database of known plasmids genes and a greedy assembly algorithm. We test HyAsP on a large sample of bacterial datasets and observe that it generally outperforms other tools. AVAILABILITY AND IMPLEMENTATION: https://github.com/cchauve/HyAsP. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Software , Algoritmos , Genoma Bacteriano , Plasmídeos , Análise de Sequência de DNA

16.

The SCJ Small Parsimony Problem for Weighted Gene Adjacencies.

Luhmann, Nina; Lafond, Manuel; Thevenin, Annelyse; Ouangraoua, Aida; Wittler, Roland; Chauve, Cedric.

IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1364-1373, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-28166504

RESUMO

Reconstructing ancestral gene orders in a given phylogeny is a classical problem in comparative genomics. Most existing methods compare conserved features in extant genomes in the phylogeny to define potential ancestral gene adjacencies, and either try to reconstruct all ancestral genomes under a global evolutionary parsimony criterion, or, focusing on a single ancestral genome, use a scaffolding approach to select a subset of ancestral gene adjacencies, generally aiming at reducing the fragmentation of the reconstructed ancestral genome. In this paper, we describe an exact algorithm for the Small Parsimony Problem that combines both approaches. We consider that gene adjacencies at internal nodes of the species phylogeny are weighted, and we introduce an objective function defined as a convex combination of these weights and the evolutionary cost under the Single-Cut-or-Join (SCJ) model. The weights of ancestral gene adjacencies can, e.g., be obtained through the recent availability of ancient DNA sequencing data, which provide a direct hint at the genome structure of the considered ancestor, or through probabilistic analysis of gene adjacencies evolution. We show the NP-hardness of our problem variant and propose a Fixed-Parameter Tractable algorithm based on the Sankoff-Rousseau dynamic programming algorithm that also allows to sample co-optimal solutions. We apply our approach to mammalian and bacterial data providing different degrees of complexity. We show that including adjacency weights in the objective has a significant impact in reducing the fragmentation of the reconstructed ancestral gene orders. An implementation is available at http://github.com/nluhmann/PhySca.

Assuntos

Algoritmos , Biologia Computacional/métodos , Genoma Bacteriano , Genômica/métodos , Animais , Evolução Biológica , Simulação por Computador , Bases de Dados Genéticas , Evolução Molecular , Ordem dos Genes , Marcadores Genéticos/genética , Modelos Genéticos , Gambás/genética , Filogenia , Plasmídeos/metabolismo , Probabilidade , Reprodutibilidade dos Testes , Suínos/genética , Yersinia/genética

17.

Alignment-free clustering of UMI tagged DNA molecules.

Orabi, Baraa; Erhan, Emre; McConeghy, Brian; Volik, Stanislav V; Le Bihan, Stephane; Bell, Robert; Collins, Colin C; Chauve, Cedric; Hach, Faraz.

Bioinformatics ; 35(11): 1829-1836, 2019 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-30351359

RESUMO

MOTIVATION: Next-Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, e.g. for detecting low allele frequency variations from circulating tumor DNA. Barcode tagging of DNA molecules with unique molecular identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are polymerase chain reaction (PCR) amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient. RESULTS: We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments generated by substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib's default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of tentative false positive in downstream variation calling. AVAILABILITY AND IMPLEMENTATION: Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available at https://github.com/vpc-ccg/calib. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Análise por Conglomerados , DNA , Análise de Sequência de DNA

18.

Scaffolding of Ancient Contigs and Ancestral Reconstruction in a Phylogenetic Framework.

Luhmann, Nina; Chauve, Cedric; Stoye, Jens; Wittler, Roland.

IEEE/ACM Trans Comput Biol Bioinform ; 15(6): 2094-2100, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29993816

RESUMO

Ancestral genome reconstruction is an important task to analyze the evolution of genomes. Recent progress in sequencing ancient DNA led to the publication of so-called paleogenomes and allows the integration of this sequencing data in genome evolution analysis. However, the de novo assembly of ancient genomes is usually fragmented due to DNA degradation over time among others. Integrated phylogenetic assembly addresses the issue of genome fragmentation in the ancient DNA assembly while aiming to improve the reconstruction of all ancient genomes in the phylogeny simultaneously. The fragmented assembly of the ancient genome can be represented as an assembly graph, indicating contradicting ordering information of contigs. In this setting, our approach is to compare the ancient data with extant finished genomes. We generalize a reconstruction approach minimizing the Single-Cut-or-Join rearrangement distance towards multifurcating trees and include edge lengths to improve the reconstruction in practice. This results in a polynomial time algorithm that includes additional ancient DNA data at one node in the tree, resulting in consistent reconstructions of ancestral genomes.

Assuntos

DNA Antigo/análise , DNA , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Animais , DNA/análise , DNA/classificação , DNA/genética , Evolução Molecular , História Antiga , História Medieval , Humanos , Modelos Genéticos , Paleontologia , Filogenia , Peste/história , Peste/microbiologia , Ratos , Alinhamento de Sequência/métodos , Yersinia pestis/classificação , Yersinia pestis/genética

19.

Phylogenetic signal from rearrangements in 18 Anopheles species by joint scaffolding extant and ancestral genomes.

Anselmetti, Yoann; Duchemin, Wandrille; Tannier, Eric; Chauve, Cedric; Bérard, Sèverine.

BMC Genomics ; 19(Suppl 2): 96, 2018 May 09.

Artigo em Inglês | MEDLINE | ID: mdl-29764366

RESUMO

BACKGROUND: Genomes rearrangements carry valuable information for phylogenetic inference or the elucidation of molecular mechanisms of adaptation. However, the detection of genome rearrangements is often hampered by current deficiencies in data and methods: Genomes obtained from short sequence reads have generally very fragmented assemblies, and comparing multiple gene orders generally leads to computationally intractable algorithmic questions. RESULTS: We present a computational method, ADSEQ, which, by combining ancestral gene order reconstruction, comparative scaffolding and de novo scaffolding methods, overcomes these two caveats. ADSEQ provides simultaneously improved assemblies and ancestral genomes, with statistical supports on all local features. Compared to previous comparative methods, it runs in polynomial time, it samples solutions in a probabilistic space, and it can handle a significantly larger gene complement from the considered extant genomes, with complex histories including gene duplications and losses. We use ADSEQ to provide improved assemblies and a genome history made of duplications, losses, gene translocations, rearrangements, of 18 complete Anopheles genomes, including several important malaria vectors. We also provide additional support for a differentiated mode of evolution of the sex chromosome and of the autosomes in these mosquito genomes. CONCLUSIONS: We demonstrate the method's ability to improve extant assemblies accurately through a procedure simulating realistic assembly fragmentation. We study a debated issue regarding the phylogeny of the Gambiae complex group of Anopheles genomes in the light of the evolution of chromosomal rearrangements, suggesting that the phylogenetic signal they carry can differ from the phylogenetic signal carried by gene sequences, more prone to introgression.

Assuntos

Anopheles/genética , Biologia Computacional/métodos , Rearranjo Gênico , Mosquitos Vetores/genética , Algoritmos , Animais , Evolução Molecular , Ordem dos Genes , Genoma de Inseto , Filogenia , Cromossomos Sexuais/genética

20.

Beaver Fever: Whole-Genome Characterization of Waterborne Outbreak and Sporadic Isolates To Study the Zoonotic Transmission of Giardiasis.

Tsui, Clement K-M; Miller, Ruth; Uyaguari-Diaz, Miguel; Tang, Patrick; Chauve, Cedric; Hsiao, William; Isaac-Renton, Judith; Prystajecky, Natalie.

mSphere ; 3(2)2018 04 25.

Artigo em Inglês | MEDLINE | ID: mdl-29695621

RESUMO

Giardia causes the diarrheal disease known as giardiasis; transmission through contaminated surface water is common. The protozoan parasite's genetic diversity has major implications for human health and epidemiology. To determine the extent of transmission from wildlife through surface water, we performed whole-genome sequencing (WGS) to characterize 89 Giardia duodenalis isolates from both outbreak and sporadic infections: 29 isolates from raw surface water, 38 from humans, and 22 from veterinary sources. Using single nucleotide variants (SNVs), combined with epidemiological data, relationships contributing to zoonotic transmission were described. Two assemblages, A and B, were identified in surface water, human, and veterinary isolates. Mixes of zoonotic assemblages A and B were seen in all the community waterborne outbreaks in British Columbia (BC), Canada, studied. Assemblage A was further subdivided into assemblages A1 and A2 based on the genetic variation observed. The A1 assemblage was highly clonal; isolates of surface water, human, and veterinary origins from Canada, United States, and New Zealand clustered together with minor variation, consistent with this being a panglobal zoonotic lineage. In contrast, assemblage B isolates were variable and consisted of several clonal lineages relating to waterborne outbreaks and geographic locations. Most human infection isolates in waterborne outbreaks clustered with isolates from surface water and beavers implicated to be outbreak sources by public health. In-depth outbreak analysis demonstrated that beavers can act as amplification hosts for human infections and can act as sources of surface water contamination. It is also known that other wild and domesticated animals, as well as humans, can be sources of waterborne giardiasis. This study demonstrates the utility of WGS in furthering our understanding of Giardia transmission dynamics at the water-human-animal interface.IMPORTANCEGiardia duodenalis causes large numbers of gastrointestinal illness in humans. Its transmission through the contaminated surface water/wildlife intersect is significant, and the water-dwelling rodents beavers have been implicated as one important reservoir. To trace human infections to their source, we used genome techniques to characterize genetic relationships among 89 Giardia isolates from surface water, humans, and animals. Our study showed the presence of two previously described genetic assemblages, A and B, with mixed infections detected from isolates collected during outbreaks. Study findings also showed that while assemblage A could be divided into A1 and A2, A1 showed little genetic variation among animal and human hosts in isolates collected from across the globe. Assemblage B, the most common type found in the study surface water samples, was shown to be highly variable. Our study demonstrates that the beaver is a possible source of human infections from contaminated surface water, while acknowledging that theirs is only one role in the complex cycle of zoonotic spread. Mixes of parasite groups have been detected in waterborne outbreaks. More information on Giardia diversity and its evolution using genomics will further the understanding of the epidemiology of spread of this disease-causing protozoan.

Assuntos

Giardia lamblia/genética , Giardíase/veterinária , Roedores/parasitologia , Água/parasitologia , Zoonoses/transmissão , Animais , Colúmbia Britânica/epidemiologia , Surtos de Doenças , Fezes/parasitologia , Variação Genética , Genótipo , Giardia lamblia/classificação , Giardia lamblia/isolamento & purificação , Giardíase/epidemiologia , Giardíase/transmissão , Humanos , Nova Zelândia/epidemiologia , Filogenia , Polimorfismo de Nucleotídeo Único , Saúde Pública , Estados Unidos/epidemiologia , Sequenciamento Completo do Genoma , Zoonoses/epidemiologia , Zoonoses/parasitologia

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA