Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 42
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Anal Chem ; 96(5): 1825-1833, 2024 02 06.
Artigo em Inglês | MEDLINE | ID: mdl-38275837

RESUMO

Cancer onset and progression are known to be regulated by genetic and epigenetic events, including RNA modifications (a.k.a. epitranscriptomics). So far, more than 150 chemical modifications have been described in all RNA subtypes, including messenger, ribosomal, and transfer RNAs. RNA modifications and their regulators are known to be implicated in all steps of post-transcriptional regulation. The dysregulation of this complex yet delicate balance can contribute to disease evolution, particularly in the context of carcinogenesis, where cells are subjected to various stresses. We sought to discover RNA modifications involved in cancer cell adaptation to inhospitable environments, a peculiar feature of cancer stem cells (CSCs). We were particularly interested in the RNA marks that help the adaptation of cancer cells to suspension culture, which is often used as a surrogate to evaluate the tumorigenic potential. For this purpose, we designed an experimental pipeline consisting of four steps: (1) cell culture in different growth conditions to favor CSC survival; (2) simultaneous RNA subtype (mRNA, rRNA, tRNA) enrichment and RNA hydrolysis; (3) the multiplex analysis of nucleosides by LC-MS/MS followed by statistical/bioinformatic analysis; and (4) the functional validation of identified RNA marks. This study demonstrates that the RNA modification landscape evolves along with the cancer cell phenotype under growth constraints. Remarkably, we discovered a short epitranscriptomic signature, conserved across colorectal cancer cell lines and associated with enrichment in CSCs. Functional tests confirmed the importance of selected marks in the process of adaptation to suspension culture, confirming the validity of our approach and opening up interesting prospects in the field.


Assuntos
Neoplasias , Processamento Pós-Transcricional do RNA , Cromatografia Líquida , Espectrometria de Massas em Tandem , RNA/genética , RNA/metabolismo , RNA de Transferência/genética , RNA de Transferência/metabolismo , Neoplasias/genética
2.
Bioinformatics ; 39(4)2023 04 03.
Artigo em Inglês | MEDLINE | ID: mdl-37010504

RESUMO

MOTIVATION: Seeking probabilistic motifs in a sequence is a common task to annotate putative transcription factor binding sites or other RNA/DNA binding sites. Useful motif representations include position weight matrices (PWMs), dinucleotide PWMs (di-PWMs), and hidden Markov models (HMMs). Dinucleotide PWMs not only combine the simplicity of PWMs-a matrix form and a cumulative scoring function-but also incorporate dependency between adjacent positions in the motif (unlike PWMs which disregard any dependency). For instance to represent binding sites, the HOCOMOCO database provides di-PWM motifs derived from experimental data. Currently, two programs, SPRy-SARUS and MOODS, can search for occurrences of di-PWMs in sequences. RESULTS: We propose a Python package called dipwmsearch, which provides an original and efficient algorithm for this task (it first enumerates matching words for the di-PWM, and then searches these all at once in the sequence, even if the latter contains IUPAC codes). The user benefits from an easy installation via Pypi or conda, a comprehensive documentation, and executable scripts that facilitate the use of di-PWMs. AVAILABILITY AND IMPLEMENTATION: dipwmsearch is available at https://pypi.org/project/dipwmsearch/ and https://gite.lirmm.fr/rivals/dipwmsearch/ under Cecill license.


Assuntos
Algoritmos , Biologia Computacional , Sítios de Ligação , Ligação Proteica , Matrizes de Pontuação de Posição Específica
3.
Bioinformatics ; 39(12)2023 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-37975872

RESUMO

MOTIVATION: Phylogenetic placement enables phylogenetic analysis of massive collections of newly sequenced DNA, when de novo tree inference is too unreliable or inefficient. Assuming that a high-quality reference tree is available, the idea is to seek the correct placement of the new sequences in that tree. Recently, alignment-free approaches to phylogenetic placement have emerged, both to circumvent the need to align the new sequences and to avoid the calculations that typically follow the alignment step. A promising approach is based on the inference of k-mers that can be potentially related to the reference sequences, also called phylo-k-mers. However, its usage is limited by the time and memory-consuming stage of reference data preprocessing and the large numbers of k-mers to consider. RESULTS: We suggest a filtering method for selecting informative phylo-k-mers based on mutual information, which can significantly improve the efficiency of placement, at the cost of a small loss in placement accuracy. This method is implemented in IPK, a new tool for computing phylo-k-mers that significantly outperforms the software previously available. We also present EPIK, a new software for phylogenetic placement, supporting filtered phylo-k-mer databases. Our experiments on real-world data show that EPIK is the fastest phylogenetic placement tool available, when placing hundreds of thousands and millions of queries while still providing accurate placements. AVAILABILITY AND IMPLEMENTATION: IPK and EPIK are freely available at https://github.com/phylo42/IPK and https://github.com/phylo42/EPIK. Both are implemented in C++ and Python and supported on Linux and MacOS.


Assuntos
Algoritmos , Software , Filogenia , Análise de Sequência de DNA , Sequência de Bases
4.
PLoS Comput Biol ; 19(10): e1011522, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37862386

RESUMO

Gene expression is the synthesis of proteins from the information encoded on DNA. One of the two main steps of gene expression is the translation of messenger RNA (mRNA) into polypeptide sequences of amino acids. Here, by taking into account mRNA degradation, we model the motion of ribosomes along mRNA with a ballistic model where particles advance along a filament without excluded volume interactions. Unidirectional models of transport have previously been used to fit the average density of ribosomes obtained by the experimental ribo-sequencing (Ribo-seq) technique in order to obtain the kinetic rates. The degradation rate is not, however, accounted for and experimental data from different experiments are needed to have enough parameters for the fit. Here, we propose an entirely novel experimental setup and theoretical framework consisting in splitting the mRNAs into categories depending on the number of ribosomes from one to four. We solve analytically the ballistic model for a fixed number of ribosomes per mRNA, study the different regimes of degradation, and propose a criterion for the quality of the inverse fit. The proposed method provides a high sensitivity to the mRNA degradation rate. The additional equations coming from using the monosome (single ribosome) and polysome (arbitrary number) ribo-seq profiles enable us to determine all the kinetic rates in terms of the experimentally accessible mRNA degradation rate.


Assuntos
Biossíntese de Proteínas , Perfil de Ribossomos , RNA Mensageiro/metabolismo , Biossíntese de Proteínas/genética , Ribossomos/genética , Ribossomos/metabolismo , Proteínas/metabolismo
5.
Malar J ; 22(1): 27, 2023 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-36698187

RESUMO

BACKGROUND: Protozoan parasites are known to attach specific and diverse group of proteins to their plasma membrane via a GPI anchor. In malaria parasites, GPI-anchored proteins (GPI-APs) have been shown to play an important role in host-pathogen interactions and a key function in host cell invasion and immune evasion. Because of their immunogenic properties, some of these proteins have been considered as malaria vaccine candidates. However, identification of all possible GPI-APs encoded by these parasites remains challenging due to their sequence diversity and limitations of the tools used for their characterization. METHODS: The FT-GPI software was developed to detect GPI-APs based on the presence of a hydrophobic helix at both ends of the premature peptide. FT-GPI was implemented in C ++and applied to study the GPI-proteome of 46 isolates of the order Haemosporida. Using the GPI proteome of Plasmodium falciparum strain 3D7 and Plasmodium vivax strain Sal-1, a heuristic method was defined to select the most sensitive and specific FT-GPI software parameters. RESULTS: FT-GPI enabled revision of the GPI-proteome of P. falciparum and P. vivax, including the identification of novel GPI-APs. Orthology- and synteny-based analyses showed that 19 of the 37 GPI-APs found in the order Haemosporida are conserved among Plasmodium species. Our analyses suggest that gene duplication and deletion events may have contributed significantly to the evolution of the GPI proteome, and its composition correlates with speciation. CONCLUSION: FT-GPI-based prediction is a useful tool for mining GPI-APs and gaining further insights into their evolution and sequence diversity. This resource may also help identify new protein candidates for the development of vaccines for malaria and other parasitic diseases.


Assuntos
Proteínas Ligadas por GPI , Plasmodium falciparum , Plasmodium vivax , Proteoma , Proteínas de Protozoários , Proteínas Ligadas por GPI/genética , Plasmodium falciparum/genética , Plasmodium vivax/genética , Proteoma/análise , Proteínas de Protozoários/genética
6.
Bioinformatics ; 36(21): 5264-5266, 2021 01 29.
Artigo em Inglês | MEDLINE | ID: mdl-32697844

RESUMO

MOTIVATION: Phylogenetic placement (PP) is a process of taxonomic identification for which several tools are now available. However, it remains difficult to assess which tool is more adapted to particular genomic data or a particular reference taxonomy. We developed Placement Evaluation WOrkflows (PEWO), the first benchmarking tool dedicated to PP assessment. Its automated workflows can evaluate PP at many levels, from parameter optimization for a particular tool, to the selection of the most appropriate genetic marker when PP-based species identifications are targeted. Our goal is that PEWO will become a community effort and a standard support for future developments and applications of PP. AVAILABILITY AND IMPLEMENTATION: https://github.com/phylo42/PEWO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Benchmarking , Software , Genoma , Filogenia , Fluxo de Trabalho
7.
Bioinformatics ; 36(22-23): 5351-5360, 2021 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-33331849

RESUMO

MOTIVATION: Novel recombinant viruses may have important medical and evolutionary significance, as they sometimes display new traits not present in the parental strains. This is particularly concerning when the new viruses combine fragments coming from phylogenetically distinct viral types. Here, we consider the task of screening large collections of sequences for such novel recombinants. A number of methods already exist for this task. However, these methods rely on complex models and heavy computations that are not always practical for a quick scan of a large number of sequences. RESULTS: We have developed SHERPAS, a new program to detect novel recombinants and provide a first estimate of their parental composition. Our approach is based on the precomputation of a large database of 'phylogenetically-informed k-mers', an idea recently introduced in the context of phylogenetic placement in metagenomics. Our experiments show that SHERPAS is hundreds to thousands of times faster than existing software, and enables the analysis of thousands of whole genomes, or long-sequencing reads, within minutes or seconds, and with limited loss of accuracy. AVAILABILITY AND IMPLEMENTATION: The source code is freely available for download at https://github.com/phylo42/sherpas. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

8.
RNA Biol ; 19(1): 132-142, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35067178

RESUMO

The last decade has seen mRNA modification emerge as a new layer of gene expression regulation. The Fat mass and obesity-associated protein (FTO) was the first identified eraser of N6-methyladenosine (m6A) adducts, the most widespread modification in eukaryotic messenger RNA. This discovery, of a reversible and dynamic RNA modification, aided by recent technological advances in RNA mass spectrometry and sequencing has led to the birth of the field of epitranscriptomics. FTO crystallized much of the attention of epitranscriptomics researchers and resulted in the publication of numerous, yet contradictory, studies describing the regulatory role of FTO in gene expression and central biological processes. These incongruities may be explained by a wide spectrum of FTO substrates and RNA sequence preferences: FTO binds multiple RNA species (mRNA, snRNA and tRNA) and can demethylate internal m6A in mRNA and snRNA, N6,2'-O-dimethyladenosine (m6Am) adjacent to the mRNA cap, and N1-methyladenosine (m1A) in tRNA. Here, we review current knowledge related to FTO function in healthy and cancer cells. In particular, we emphasize the divergent role(s) attributed to FTO in different tissues and subcellular and molecular contexts.


Assuntos
Tecido Adiposo/metabolismo , Dioxigenase FTO Dependente de alfa-Cetoglutarato/genética , Dioxigenase FTO Dependente de alfa-Cetoglutarato/metabolismo , Regulação da Expressão Gênica , Neoplasias/etiologia , Neoplasias/metabolismo , Adenosina/análogos & derivados , Tecido Adiposo/anatomia & histologia , Adiposidade , Catálise , Suscetibilidade a Doenças , Epigênese Genética , Homeostase , Humanos , Neoplasias/patologia , Especificidade de Órgãos , Processamento Pós-Transcricional do RNA , RNA Mensageiro/genética , RNA Nuclear Pequeno/genética , RNA de Transferência/genética , Proteínas de Ligação a RNA , Especificidade por Substrato
9.
Genome Res ; 27(5): 835-848, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28396522

RESUMO

A viral quasispecies, the ensemble of viral strains populating an infected person, can be highly diverse. For optimal assessment of virulence, pathogenesis, and therapy selection, determining the haplotypes of the individual strains can play a key role. As many viruses are subject to high mutation and recombination rates, high-quality reference genomes are often not available at the time of a new disease outbreak. We present SAVAGE, a computational tool for reconstructing individual haplotypes of intra-host virus strains without the need for a high-quality reference genome. SAVAGE makes use of either FM-index-based data structures or ad hoc consensus reference sequence for constructing overlap graphs from patient sample data. In this overlap graph, nodes represent reads and/or contigs, while edges reflect that two reads/contigs, based on sound statistical considerations, represent identical haplotypic sequence. Following an iterative scheme, a new overlap assembly algorithm that is based on the enumeration of statistically well-calibrated groups of reads/contigs then efficiently reconstructs the individual haplotypes from this overlap graph. In benchmark experiments on simulated and on real deep-coverage data, SAVAGE drastically outperforms generic de novo assemblers as well as the only specialized de novo viral quasispecies assembler available so far. When run on ad hoc consensus reference sequence, SAVAGE performs very favorably in comparison with state-of-the-art reference genome-guided tools. We also apply SAVAGE on two deep-coverage samples of patients infected by the Zika and the hepatitis C virus, respectively, which sheds light on the genetic structures of the respective viral quasispecies.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma Viral , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Mapeamento de Sequências Contíguas/normas , Genômica/normas , Haplótipos , Hepacivirus/genética , Polimorfismo Genético , Padrões de Referência , Análise de Sequência de DNA/normas , Zika virus/genética
10.
Bioinformatics ; 35(17): 3163-3165, 2019 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30649190

RESUMO

MOTIVATION: The visualization and interpretation of evolutionary spatiotemporal scenarios is broadly and increasingly used in infectious disease research, ecology or agronomy. Using probabilistic frameworks, well-known tools can infer from molecular data ancestral traits for internal nodes in a phylogeny, and numerous phylogenetic rendering tools can display such evolutionary trees. However, visualizing such ancestral information and its uncertainty on the tree remains tedious. For instance, ancestral nodes can be associated to several geographical annotations with close probabilities and thus, several migration or transmission scenarios exist. RESULTS: We expose a web-based tool, named AQUAPONY, that facilitates such operations. Given an evolutionary tree with ancestral (e.g. geographical) annotations, the user can easily control the display of ancestral information on the entire tree or a subtree, and can view alternative phylogeographic scenarios along a branch according to a chosen uncertainty threshold. AQUAPONY interactively visualizes the tree and eases the objective interpretation of evolutionary scenarios. AQUAPONY's implementation makes it highly responsive to user interaction, and instantaneously updates the tree visualizations even for large trees (which can be exported as image files). AVAILABILITY AND IMPLEMENTATION: AQUAPONY is coded in JavaScript/HTML, available under Cecill license, and can be freely used at http://www.atgc-montpellier.fr/aquapony/.


Assuntos
Filogenia , Software , Fenótipo , Filogeografia
11.
Bioinformatics ; 33(6): 799-806, 2017 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-27273673

RESUMO

Motivation: New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads. Results: We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k -mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher. Availability and Implementation: LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/ . Contact: leena.salmela@cs.helsinki.fi.


Assuntos
Análise de Sequência de DNA/métodos , Software , Algoritmos , Escherichia coli/genética , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Saccharomyces cerevisiae/genética
12.
Malar J ; 16(1): 493, 2017 Dec 19.
Artigo em Inglês | MEDLINE | ID: mdl-29258508

RESUMO

BACKGROUND: Plasmodium falciparum malaria is one of the most widespread parasitic infections in humans and remains a leading global health concern. Malaria elimination efforts are threatened by the emergence and spread of resistance to artemisinin-based combination therapy, the first-line treatment of malaria. Promising molecular markers and pathways associated with artemisinin drug resistance have been identified, but the underlying molecular mechanisms of resistance remains unknown. The genomic data from early period of emergence of artemisinin resistance (2008-2011) was evaluated, with aim to define k13 associated genetic background in Cambodia, the country identified as epicentre of anti-malarial drug resistance, through characterization of 167 parasite isolates using a panel of 21,257 SNPs. RESULTS: Eight subpopulations were identified suggesting a process of acquisition of artemisinin resistance consistent with an emergence-selection-diffusion model, supported by the shifting balance theory. Identification of population specific mutations facilitated the characterization of a core set of 57 background genes associated with artemisinin resistance and associated pathways. The analysis indicates that the background of artemisinin resistance was not acquired after drug pressure, rather is the result of fixation followed by selection on the daughter subpopulations derived from the ancestral population. CONCLUSIONS: Functional analysis of artemisinin resistance subpopulations illustrates the strong interplay between ubiquitination and cell division or differentiation in artemisinin resistant parasites. The relationship of these pathways with the P. falciparum resistant subpopulation and presence of drug resistance markers in addition to k13, highlights the major role of admixed parasite population in the diffusion of artemisinin resistant background. The diffusion of resistant genes in the Cambodian admixed population after selection resulted from mating of gametocytes of sensitive and resistant parasite populations.


Assuntos
Artemisininas/farmacologia , Resistência a Medicamentos , Malária Falciparum/epidemiologia , Plasmodium falciparum/efeitos dos fármacos , Plasmodium falciparum/genética , Antimaláricos/farmacologia , Camboja/epidemiologia , Genótipo , Humanos , Malária Falciparum/parasitologia , Mutação , Plasmodium falciparum/classificação , Plasmodium falciparum/metabolismo , Polimorfismo de Nucleotídeo Único , Proteínas de Protozoários/genética
13.
BMC Bioinformatics ; 17(1): 237, 2016 Jun 16.
Artigo em Inglês | MEDLINE | ID: mdl-27306641

RESUMO

BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS: Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS: Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data.


Assuntos
Escherichia coli/genética , Genoma Humano , Genômica/métodos , Algoritmos , Mapeamento de Sequências Contíguas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
14.
Nucleic Acids Res ; 42(5): 2820-32, 2014 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24357408

RESUMO

Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as 'TranscriRef'). We then annotated 750,000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34,000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.


Assuntos
Perfilação da Expressão Gênica/métodos , Genoma Humano , RNA não Traduzido/análise , Análise de Sequência de RNA/métodos , Linhagem Celular , Humanos , Anotação de Sequência Molecular , Poli A/análise , Software , Transcrição Gênica
15.
BMC Bioinformatics ; 16: 111, 2015 Apr 02.
Artigo em Inglês | MEDLINE | ID: mdl-25885358

RESUMO

BACKGROUND: Comparing and aligning genomes is a key step in analyzing closely related genomes. Despite the development of many genome aligners in the last 15 years, the problem is not yet fully resolved, even when aligning closely related bacterial genomes of the same species. In addition, no procedures are available to assess the quality of genome alignments or to compare genome aligners. RESULTS: We designed an original method for pairwise genome alignment, named YOC, which employs a highly sensitive similarity detection method together with a recent collinear chaining strategy that allows overlaps. YOC improves the reliability of collinear genome alignments, while preserving or even improving sensitivity. We also propose an original qualitative evaluation criterion for measuring the relevance of genome alignments. We used this criterion to compare and benchmark YOC with five recent genome aligners on large bacterial genome datasets, and showed it is suitable for identifying the specificities and the potential flaws of their underlying strategies. CONCLUSIONS: The YOC prototype is available at https://github.com/ruricaru/YOC . It has several advantages over existing genome aligners: (1) it is based on a simplified two phase alignment strategy, (2) it is easy to parameterize, (3) it produces reliable genome alignments, which are easier to analyze and to use.


Assuntos
Interface Usuário-Computador , Algoritmos , Hibridização Genômica Comparativa , Genoma Bacteriano , Internet , Lactococcus lactis/genética , Alinhamento de Sequência
16.
Bioinformatics ; 30(24): 3506-14, 2014 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-25165095

RESUMO

MOTIVATION: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space. RESULTS: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Animais , Escherichia coli/genética , Genômica/métodos , Papagaios/genética , Software , Leveduras/genética
17.
BMC Genomics ; 15: 1103, 2014 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-25494611

RESUMO

BACKGROUND: Cost effective next generation sequencing technologies now enable the production of genomic datasets for many novel planktonic eukaryotes, representing an understudied reservoir of genetic diversity. O. tauri is the smallest free-living photosynthetic eukaryote known to date, a coccoid green alga that was first isolated in 1995 in a lagoon by the Mediterranean sea. Its simple features, ease of culture and the sequencing of its 13 Mb haploid nuclear genome have promoted this microalga as a new model organism for cell biology. Here, we investigated the quality of genome assemblies of Illumina GAIIx 75 bp paired-end reads from Ostreococcus tauri, thereby also improving the existing assembly and showing the genome to be stably maintained in culture. RESULTS: The 3 assemblers used, ABySS, CLCBio and Velvet, produced 95% complete genomes in 1402 to 2080 scaffolds with a very low rate of misassembly. Reciprocally, these assemblies improved the original genome assembly by filling in 930 gaps. Combined with additional analysis of raw reads and PCR sequencing effort, 1194 gaps have been solved in total adding up to 460 kb of sequence. Mapping of RNAseq Illumina data on this updated genome led to a twofold reduction in the proportion of multi-exon protein coding genes, representing 19% of the total 7699 protein coding genes. The comparison of the DNA extracted in 2001 and 2009 revealed the fixation of 8 single nucleotide substitutions and 2 deletions during the approximately 6000 generations in the lab. The deletions either knocked out or truncated two predicted transmembrane proteins, including a glutamate-receptor like gene. CONCLUSION: High coverage (>80 fold) paired-end Illumina sequencing enables a high quality 95% complete genome assembly of a compact ~13 Mb haploid eukaryote. This genome sequence has remained stable for 6000 generations of lab culture.


Assuntos
Clorófitas/genética , Genoma de Planta , Genômica , Biologia Computacional , Evolução Molecular , Variação Genética , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência Molecular , Dados de Sequência Molecular
18.
Genome Res ; 21(9): 1438-49, 2011 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-21750104

RESUMO

In metazoans, thousands of DNA replication origins (Oris) are activated at each cell cycle. Their genomic organization and their genetic nature remain elusive. Here, we characterized Oris by nascent strand (NS) purification and a genome-wide analysis in Drosophila and mouse cells. We show that in both species most CpG islands (CGI) contain Oris, although methylation is nearly absent in Drosophila, indicating that this epigenetic mark is not crucial for defining the activated origin. Initiation of DNA synthesis starts at the borders of CGI, resulting in a striking bimodal distribution of NS, suggestive of a dual initiation event. Oris contain a unique nucleotide skew around NS peaks, characterized by G/T and C/A overrepresentation at the 5' and 3' of Ori sites, respectively. Repeated GC-rich elements were detected, which are good predictors of Oris, suggesting that common sequence features are part of metazoan Oris. In the heterochromatic chromosome 4 of Drosophila, Oris correlated with HP1 binding sites. At the chromosome level, regions rich in Oris are early replicating, whereas Ori-poor regions are late replicating. The genome-wide analysis was coupled with a DNA combing analysis to unravel the organization of Oris. The results indicate that Oris are in a large excess, but their activation does not occur at random. They are organized in groups of site-specific but flexible origins that define replicons, where a single origin is activated in each replicon. This organization provides both site specificity and Ori firing flexibility in each replicon, allowing possible adaptation to environmental cues and cell fates.


Assuntos
Replicação do DNA/genética , Genômica , Origem de Replicação/genética , Animais , Sequência de Bases , Sítios de Ligação/genética , Linhagem Celular , Proteínas Cromossômicas não Histona/metabolismo , Mapeamento Cromossômico , Sequência Conservada/genética , Ilhas de CpG , Drosophila/genética , Heterocromatina/genética , Camundongos , Regiões Promotoras Genéticas , Transcrição Gênica
19.
Nucleic Acids Res ; 39(15): e101, 2011 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-21646341

RESUMO

Genome comparison is now a crucial step for genome annotation and identification of regulatory motifs. Genome comparison aims for instance at finding genomic regions either specific to or in one-to-one correspondence between individuals/strains/species. It serves e.g. to pre-annotate a new genome by automatically transferring annotations from a known one. However, efficiency, flexibility and objectives of current methods do not suit the whole spectrum of applications, genome sizes and organizations. Innovative approaches are still needed. Hence, we propose an alternative way of comparing multiple genomes based on segmentation by similarity. In this framework, rather than being formulated as a complex optimization problem, genome comparison is seen as a segmentation question for which a single optimal solution can be found in almost linear time. We apply our method to analyse three strains of a virulent pathogenic bacteria, Ehrlichia ruminantium, and identify 92 new genes. We also find out that a substantial number of genes thought to be strain specific have potential orthologs in the other strains. Our solution is implemented in an efficient program, qod, equipped with a user-friendly interface, and enables the automatic transfer of annotations between compared genomes or contigs (Video in Supplementary Data). Because it somehow disregards the relative order of genomic blocks, qod can handle unfinished genomes, which due to the difficulty of sequencing completion may become an interesting characteristic for the future. Availabilty: http://www.atgc-montpellier.fr/qod.


Assuntos
Genômica/métodos , Software , Algoritmos , Ehrlichia ruminantium/classificação , Ehrlichia ruminantium/genética , Genes Bacterianos , Genoma Bacteriano , Especificidade da Espécie
20.
IEEE/ACM Trans Comput Biol Bioinform ; 20(5): 2889-2897, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37204943

RESUMO

Finding the correct position of new sequences within an established phylogenetic tree is an increasingly relevant problem in evolutionary bioinformatics and metagenomics. Recently, alignment-free approaches for this task have been proposed. One such approach is based on the concept of phylogenetically-informative k-mers or phylo- k-mers for short. In practice, phylo- k-mers are inferred from a set of related reference sequences and are equipped with scores expressing the probability of their appearance in different locations within the input reference phylogeny. Computing phylo- k-mers, however, represents a computational bottleneck to their applicability in real-world problems such as the phylogenetic analysis of metabarcoding reads and the detection of novel recombinant viruses. Here we consider the problem of phylo- k-mer computation: how can we efficiently find all k-mers whose probability lies above a given threshold for a given tree node? We describe and analyze algorithms for this problem, relying on branch-and-bound and divide-and-conquer techniques. We exploit the redundancy of adjacent windows of the alignment to save on computation. Besides computational complexity analyses, we provide an empirical evaluation of the relative performance of their implementations on simulated and real-world data. The divide-and-conquer algorithms are found to surpass the branch-and-bound approach, especially when many phylo- k-mers are found.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA