Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Nat Methods ; 14(11): 1063-1071, 2017 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-28967888

RESUMO

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.


Assuntos
Metagenômica , Software , Algoritmos , Benchmarking , Análise de Sequência de DNA
2.
Nucleic Acids Res ; 45(8): e57, 2017 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-28053114

RESUMO

Whole transcriptome sequencing (RNA-seq) has become a standard for cataloguing and monitoring RNA populations. One of the main bottlenecks, however, is to correctly identify the different classes of RNAs among the plethora of reconstructed transcripts, particularly those that will be translated (mRNAs) from the class of long non-coding RNAs (lncRNAs). Here, we present FEELnc (FlExible Extraction of LncRNAs), an alignment-free program that accurately annotates lncRNAs based on a Random Forest model trained with general features such as multi k-mer frequencies and relaxed open reading frames. Benchmarking versus five state-of-the-art tools shows that FEELnc achieves similar or better classification performance on GENCODE and NONCODE data sets. The program also provides specific modules that enable the user to fine-tune classification accuracy, to formalize the annotation of lncRNA classes and to identify lncRNAs even in the absence of a training set of non-coding RNAs. We used FEELnc on a real data set comprising 20 canine RNA-seq samples produced by the European LUPA consortium to substantially expand the canine genome annotation to include 10 374 novel lncRNAs and 58 640 mRNA transcripts. FEELnc moves beyond conventional coding potential classifiers by providing a standardized and complete solution for annotating lncRNAs and is freely available at https://github.com/tderrien/FEELnc.


Assuntos
Genoma , Anotação de Sequência Molecular/métodos , RNA Longo não Codificante/genética , Software , Transcriptoma , Animais , Benchmarking , Árvores de Decisões , Cães , Regulação da Expressão Gênica , Humanos , Camundongos , Anotação de Sequência Molecular/estatística & dados numéricos , Fases de Leitura Aberta , RNA Longo não Codificante/classificação , RNA Longo não Codificante/metabolismo , RNA Mensageiro/classificação , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Análise de Sequência de RNA
3.
Nucleic Acids Res ; 43(2): e11, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25404127

RESUMO

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, discoSnp, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, discoSnp ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, discoSnp requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.


Assuntos
Técnicas de Genotipagem/métodos , Polimorfismo de Nucleotídeo Único , Algoritmos , Animais , Cromossomos Humanos Par 1 , Escherichia coli/genética , Genômica/métodos , Humanos , Ixodes/genética , Camundongos , Camundongos Endogâmicos C57BL , Saccharomyces cerevisiae/genética
4.
BMC Bioinformatics ; 16: 288, 2015 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-26370285

RESUMO

BACKGROUND: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS: We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS: LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.


Assuntos
Algoritmos , Proteínas de Caenorhabditis elegans/genética , Caenorhabditis elegans/genética , Gráficos por Computador , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Animais , Biologia Computacional/métodos , Simulação por Computador , Metagenômica , Probabilidade
5.
Bioinformatics ; 30(24): 3451-7, 2014 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-25123898

RESUMO

MOTIVATION: Insertions play an important role in genome evolution. However, such variants are difficult to detect from short-read sequencing data, especially when they exceed the paired-end insert size. Many approaches have been proposed to call short insertion variants based on paired-end mapping. However, there remains a lack of practical methods to detect and assemble long variants. RESULTS: We propose here an original method, called MindTheGap, for the integrated detection and assembly of insertion variants from re-sequencing data. Importantly, it is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. MindTheGap uses an efficient k-mer-based method to detect insertion sites in a reference genome, and subsequently assemble them from the donor reads. MindTheGap showed high recall and precision on simulated datasets of various genome complexities. When applied to real Caenorhabditis elegans and human NA12878 datasets, MindTheGap detected and correctly assembled insertions >1 kb, using at most 14 GB of memory.


Assuntos
Mutagênese Insercional , Análise de Sequência de DNA/métodos , Animais , Caenorhabditis elegans/genética , Genoma , Humanos , Software
6.
Bioinformatics ; 30(20): 2959-61, 2014 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-24990603

RESUMO

MOTIVATION: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation. RESULTS: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints. AVAILABILITY AND IMPLEMENTATION: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license. CONTACT: lavenier@irisa.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bioestatística/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Gráficos por Computador , Genoma Humano/genética , Humanos
7.
Bioinformatics ; 29(5): 652-3, 2013 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-23325618

RESUMO

SUMMARY: Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers. AVAILABILITY: http://minia.genouest.org/dsk


Assuntos
Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos , Software , Algoritmos , Genoma Humano , Humanos
8.
Microorganisms ; 10(4)2022 Mar 25.
Artigo em Inglês | MEDLINE | ID: mdl-35456762

RESUMO

Metagenomics analysis is now routinely used for clinical diagnosis in several diseases, and we need confidence in interpreting metagenomics analysis of microbiota. Particularly from the side of clinical microbiology, we consider that it would be a major milestone to further advance microbiota studies with an innovative and significant approach consisting of processing steps and quality assessment for interpreting metagenomics data used for diagnosis. Here, we propose a methodology for taxon identification and abundance assessment of shotgun sequencing data of microbes that are well fitted for clinical setup. Processing steps of quality controls have been developed in order (i) to avoid low-quality reads and sequences, (ii) to optimize abundance thresholds and profiles, (iii) to combine classifiers and reference databases for best classification of species and abundance profiles for both prokaryotic and eukaryotic sequences, and (iv) to introduce external positive control. We find that the best strategy is to use a pipeline composed of a combination of different but complementary classifiers such as Kraken2/Bracken and Kaiju. Such improved quality assessment will have a major impact on the robustness of biological and clinical conclusions drawn from metagenomic studies.

9.
Bioinformatics ; 26(20): 2534-40, 2010 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-20739310

RESUMO

MOTIVATION: The rapid development of next-generation sequencing technologies able to produce huge amounts of sequence data is leading to a wide range of new applications. This triggers the need for fast and accurate alignment software. Common techniques often restrict indels in the alignment to improve speed, whereas more flexible aligners are too slow for large-scale applications. Moreover, many current aligners are becoming inefficient as generated reads grow ever larger. Our goal with our new aligner GASSST (Global Alignment Short Sequence Search Tool) is thus 2-fold-achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. RESULTS: We propose a new efficient filtering step that discards most alignments coming from the seed phase before they are checked by the costly dynamic programming algorithm. We use a carefully designed series of filters of increasing complexity and efficiency to quickly eliminate most candidate alignments in a wide range of configurations. The main filter uses a precomputed table containing the alignment score of short four base words aligned against each other. This table is reused several times by a new algorithm designed to approximate the score of the full dynamic programming algorithm. We compare the performance of GASSST against BWA, BFAST, SSAHA2 and PASS. We found that GASSST achieves high sensitivity in a wide range of configurations and faster overall execution time than other state-of-the-art aligners. AVAILABILITY: GASSST is distributed under the CeCILL software license at http://www.irisa.fr/symbiose/projects/gassst/ CONTACT: guillaume.rizk@irisa.fr; dominique.lavenier@irisa.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Alinhamento de Sequência/métodos , Software , Sequência de Bases , Análise de Sequência de DNA/métodos
10.
BMC Genomics ; 11: 281, 2010 May 05.
Artigo em Inglês | MEDLINE | ID: mdl-20444247

RESUMO

BACKGROUND: Post-transcriptional regulation in eukaryotes can be operated through microRNA (miRNAs) mediated gene silencing. MiRNAs are small (18-25 nucleotides) non-coding RNAs that play crucial role in regulation of gene expression in eukaryotes. In insects, miRNAs have been shown to be involved in multiple mechanisms such as embryonic development, tissue differentiation, metamorphosis or circadian rhythm. Insect miRNAs have been identified in different species belonging to five orders: Coleoptera, Diptera, Hymenoptera, Lepidoptera and Orthoptera. RESULTS: We developed high throughput Solexa sequencing and bioinformatic analyses of the genome of the pea aphid Acyrthosiphon pisum in order to identify the first miRNAs from a hemipteran insect. By combining these methods we identified 149 miRNAs including 55 conserved and 94 new miRNAs. Moreover, we investigated the regulation of these miRNAs in different alternative morphs of the pea aphid by analysing the expression of miRNAs across the switch of reproduction mode. Pea aphid microRNA sequences have been posted to miRBase: http://microrna.sanger.ac.uk/sequences/. CONCLUSIONS: Our study has identified candidates as putative regulators involved in reproductive polyphenism in aphids and opens new avenues for further functional analyses.


Assuntos
Afídeos/genética , Perfilação da Expressão Gênica , MicroRNAs/análise , Animais , Sequência de Bases , MicroRNAs/genética
11.
Algorithms Mol Biol ; 8(1): 22, 2013 Sep 16.
Artigo em Inglês | MEDLINE | ID: mdl-24040893

RESUMO

BACKGROUND: The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB). RESULTS: We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. CONCLUSIONS: An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa