Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
1.
Mol Biol Evol ; 40(3)2023 03 04.
Artigo em Inglês | MEDLINE | ID: mdl-36790822

RESUMO

Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.


Assuntos
Genética Populacional , Metagenômica , Genômica , Genoma , Haplótipos
2.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37195454

RESUMO

MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate. RESULTS: We propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time. AVAILABILITY AND IMPLEMENTATION: https://github.com/lrobidou/fimpera.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA/métodos , Metagenômica , Sequenciamento de Nucleotídeos em Larga Escala/métodos
3.
Bioinformatics ; 38(24): 5443-5445, 2022 12 13.
Artigo em Inglês | MEDLINE | ID: mdl-36315078

RESUMO

SUMMARY: Genome wide association studies elucidate links between genotypes and phenotypes. Recent studies point out the interest of conducting such experiments using k-mers as the base signal instead of single-nucleotide polymorphisms. We propose a tool, kmdiff, that performs differential k-mer analyses on large sequencing cohorts in an order of magnitude less time and memory than previously possible. AVAILABILITYAND IMPLEMENTATION: https://github.com/tlemane/kmdiff. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA , Estudo de Associação Genômica Ampla , Genótipo
4.
Bioinformatics ; 38(18): 4423-4425, 2022 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-35904548

RESUMO

SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools. AVAILABILITY AND IMPLEMENTATION: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA , Discos Compactos
5.
Bioinformatics ; 36(17): 4568-4575, 2020 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-32437523

RESUMO

MOTIVATION: Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. RESULTS: We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. AVAILABILITY AND IMPLEMENTATION: https://github.com/llecompte/SVJedi.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma Humano , Software , Variação Estrutural do Genoma , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
6.
Bioinformatics ; 36(5): 1374-1381, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-30785192

RESUMO

MOTIVATION: Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information. RESULTS: We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond. AVAILABILITY AND IMPLEMENTATION: The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Genoma Humano , Humanos , Análise de Sequência de DNA , Software
7.
Bioinformatics ; 36(4): 1275-1276, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31504187

RESUMO

MOTIVATION: De novo comparative metagenomics is one of the most straightforward ways to analyze large sets of metagenomic data. Latest methods use the fraction of shared k-mers to estimate genomic similarity between read sets. However, those methods, while extremely efficient, are still limited by computational needs for practical usage outside of large computing facilities. RESULTS: We present SimkaMin, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities. One billion metagenomic reads can be analyzed in <3 min, with tiny memory (1.09 GB) and disk (≈0.3 GB) requirements and without altering the quality of the downstream comparative analyses, making of SimkaMin a tool perfectly tailored for very large-scale metagenomic projects. AVAILABILITY AND IMPLEMENTATION: https://github.com/GATB/simka. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metagenômica , Software , Algoritmos , Genômica , Metagenoma , Análise de Sequência de DNA
8.
Nucleic Acids Res ; 47(1): e2, 2019 01 10.
Artigo em Inglês | MEDLINE | ID: mdl-30260405

RESUMO

Long-read sequencing currently provides sequences of several thousand base pairs. It is therefore possible to obtain complete transcripts, offering an unprecedented vision of the cellular transcriptome. However the literature lacks tools for de novo clustering of such data, in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads. Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution both proposes a new algorithm adapted to clustering of reads by gene and a practical and free access tool that allows to scale the complete processing of eukaryotic transcriptomes. We sequenced a mouse RNA sample using the MinION device. This dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate that it is the best approach for transcriptomics long reads. When a reference is available to enable mapping, we show that it stands as an alternative method that predicts complementary clusters.


Assuntos
Perfilação da Expressão Gênica/métodos , Genômica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Transcriptoma/genética , Animais , Genoma/genética , Camundongos , RNA/genética , Análise de Sequência de DNA
9.
Mol Cancer ; 19(1): 36, 2020 02 25.
Artigo em Inglês | MEDLINE | ID: mdl-32098627

RESUMO

BACKGROUND: Literature reports that mature microRNA (miRNA) can be methylated at adenosine, guanosine and cytosine. However, the molecular mechanisms involved in cytosine methylation of miRNAs have not yet been fully elucidated. Here we investigated the biological role and underlying mechanism of cytosine methylation in miRNAs in glioblastoma multiforme (GBM). METHODS: RNA immunoprecipitation with the anti-5methylcytosine (5mC) antibody followed by Array, ELISA, dot blot, incorporation of a radio-labelled methyl group in miRNA, and miRNA bisulfite sequencing were perfomred to detect the cytosine methylation in mature miRNA. Cross-Linking immunoprecipiation qPCR, transfection with methylation/unmethylated mimic miRNA, luciferase promoter reporter plasmid, Biotin-tagged 3'UTR/mRNA or miRNA experiments and in vivo assays were used to investigate the role of methylated miRNAs. Finally, the prognostic value of methylated miRNAs was analyzed in a cohorte of GBM pateints. RESULTS: Our study reveals that a significant fraction of miRNAs contains 5mC. Cellular experiments show that DNMT3A/AGO4 methylated miRNAs at cytosine residues inhibit the formation of miRNA/mRNA duplex and leading to the loss of their repressive function towards gene expression. In vivo experiments show that cytosine-methylation of miRNA abolishes the tumor suppressor function of miRNA-181a-5p miRNA for example. Our study also reveals that cytosine-methylation of miRNA-181a-5p results is associated a poor prognosis in GBM patients. CONCLUSION: Together, our results indicate that the DNMT3A/AGO4-mediated cytosine methylation of miRNA negatively.


Assuntos
Biomarcadores Tumorais/genética , Citosina/química , Metilação de DNA , Glioblastoma/patologia , MicroRNAs/genética , Animais , Apoptose , Proteínas Argonautas/genética , Proteínas Argonautas/metabolismo , Proliferação de Células , DNA (Citosina-5-)-Metiltransferases/genética , DNA (Citosina-5-)-Metiltransferases/metabolismo , DNA Metiltransferase 3A , Fatores de Iniciação em Eucariotos/genética , Fatores de Iniciação em Eucariotos/metabolismo , Regulação Neoplásica da Expressão Gênica , Glioblastoma/genética , Glioblastoma/metabolismo , Humanos , Camundongos , Camundongos Nus , Prognóstico , Regiões Promotoras Genéticas , Taxa de Sobrevida , Células Tumorais Cultivadas , Ensaios Antitumorais Modelo de Xenoenxerto
10.
Nat Methods ; 14(11): 1063-1071, 2017 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-28967888

RESUMO

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.


Assuntos
Metagenômica , Software , Algoritmos , Benchmarking , Análise de Sequência de DNA
11.
BMC Genomics ; 20(1): 56, 2019 Jan 17.
Artigo em Inglês | MEDLINE | ID: mdl-30654742

RESUMO

BACKGROUND: Accurate structural annotation of genomes is still a challenge, despite the progress made over the past decade. The prediction of gene structure remains difficult, especially for eukaryotic species, and is often erroneous and incomplete. We used a proteogenomics strategy, taking advantage of the combination of proteomics datasets and bioinformatics tools, to identify novel protein coding-genes and splice isoforms, assign correct start sites, and validate predicted exons and genes. RESULTS: Our proteogenomics workflow, Peptimapper, was applied to the genome annotation of Ectocarpus sp., a key reference genome for both the brown algal lineage and stramenopiles. We generated proteomics data from various life cycle stages of Ectocarpus sp. strains and sub-cellular fractions using a shotgun approach. First, we directly generated peptide sequence tags (PSTs) from the proteomics data. Second, we mapped PSTs onto the translated genomic sequence. Closely located hits (i.e., PSTs locations on the genome) were then clustered to detect potential coding regions based on parameters optimized for the organism. Third, we evaluated each cluster and compared it to gene predictions from existing conventional genome annotation approaches. Finally, we integrated cluster locations into GFF files to use a genome viewer. We identified two potential novel genes, a ribosomal protein L22 and an aryl sulfotransferase and corrected the gene structure of a dihydrolipoamide acetyltransferase. We experimentally validated the results by RT-PCR and using transcriptomics data. CONCLUSIONS: Peptimapper is a complementary tool for the expert annotation of genomes. It is suitable for any organism and is distributed through a Docker image available on two public bioinformatics docker repositories: Docker Hub and BioShaDock. This workflow is also accessible through the Galaxy framework and for use by non-computer scientists at https://galaxy.protim.eu . Data are available via ProteomeXchange under identifier PXD010618.


Assuntos
Eucariotos/genética , Genoma , Anotação de Sequência Molecular , Proteogenômica/métodos , Software , Fluxo de Trabalho , Sequência de Aminoácidos , Códon/genética , Espectrometria de Massas , Peptídeos/química , Peptídeos/metabolismo , Reprodutibilidade dos Testes
12.
Nucleic Acids Res ; 43(2): e11, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25404127

RESUMO

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, discoSnp, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, discoSnp ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, discoSnp requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.


Assuntos
Técnicas de Genotipagem/métodos , Polimorfismo de Nucleotídeo Único , Algoritmos , Animais , Cromossomos Humanos Par 1 , Escherichia coli/genética , Genômica/métodos , Humanos , Ixodes/genética , Camundongos , Camundongos Endogâmicos C57BL , Saccharomyces cerevisiae/genética
13.
BMC Bioinformatics ; 17(1): 237, 2016 Jun 16.
Artigo em Inglês | MEDLINE | ID: mdl-27306641

RESUMO

BACKGROUND: Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. RESULTS: Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution. We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22 % more reads can be mapped on the graph but not on the contig set. CONCLUSIONS: Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data.


Assuntos
Escherichia coli/genética , Genoma Humano , Genômica/métodos , Algoritmos , Mapeamento de Sequências Contíguas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
14.
BMC Genomics ; 17: 121, 2016 Feb 18.
Artigo em Inglês | MEDLINE | ID: mdl-26892170

RESUMO

BACKGROUND: Progress in genetics and breeding in pea still suffers from the limited availability of molecular resources. SNP markers that can be identified through affordable sequencing processes, without the need for prior genome reduction or a reference genome to assemble sequencing data would allow the discovery and genetic mapping of thousands of molecular markers. Such an approach could significantly speed up genetic studies and marker assisted breeding for non-model species. RESULTS: A total of 419,024 SNPs were discovered using HiSeq whole genome sequencing of four pea lines, followed by direct identification of SNP markers without assembly using the discoSnp tool. Subsequent filtering led to the identification of 131,850 highly designable SNPs, polymorphic between at least two of the four pea lines. A subset of 64,754 SNPs was called and genotyped by short read sequencing on a subpopulation of 48 RILs from the cross 'Baccara' x 'PI180693'. This data was used to construct a WGGBS-derived pea genetic map comprising 64,263 markers. This map is collinear with previous pea consensus maps and therefore with the Medicago truncatula genome. Sequencing of four additional pea lines showed that 33 % to 64 % of the mapped SNPs, depending on the pairs of lines considered, are polymorphic and can therefore be useful in other crosses. The subsequent genotyping of a subset of 1000 SNPs, chosen for their mapping positions using a KASP™ assay, showed that almost all generated SNPs are highly designable and that most (95 %) deliver highly qualitative genotyping results. Using rather low sequencing coverages in SNP discovery and in SNP inferring did not hinder the identification of hundreds of thousands of high quality SNPs. CONCLUSIONS: The development and optimization of appropriate tools in SNP discovery and genetic mapping have allowed us to make available a massive new genomic resource in pea. It will be useful for both fine mapping within chosen QTL confidence intervals and marker assisted breeding for important traits in pea improvement.


Assuntos
Mapeamento Cromossômico , Genoma de Planta , Pisum sativum/genética , Polimorfismo de Nucleotídeo Único , DNA de Plantas/genética , Marcadores Genéticos , Genótipo , Técnicas de Genotipagem , Análise de Sequência de DNA
16.
BMC Genomics ; 16: 296, 2015 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-25886522

RESUMO

BACKGROUND: Propionibacterium freudenreichii (PF) is an actinobacterium used in cheese technology and for its probiotic properties. PF is also extremely adaptable to several ecological niches and can grow on a variety of carbon and nitrogen sources. The aim of this work was to discover the genetic basis for strain-dependent traits related to its ability to use specific carbon sources. High-throughput sequencing technologies were ideal for this purpose as they have the potential to decipher genomic diversity at a moderate cost. RESULTS: 21 strains of PF were sequenced and the genomes were assembled de novo. Scaffolds were ordered by comparison with the complete reference genome CIRM-BIA1, obtained previously using traditional Sanger sequencing. Automatic functional annotation and manual curation were performed. Each gene was attributed to either the core genome or an accessory genome. The ability of the 21 strains to degrade 50 different sugars was evaluated. Thirty-three sugars were degraded by none of the sequenced strains whereas eight sugars were degraded by all of them. The corresponding genes were present in the core genome. Lactose, melibiose and xylitol were only used by some strains. In this case, the presence/absence of genes responsible for carbon uptake and degradation correlated well with the phenotypes, with the exception of xylitol. Furthermore, the simultaneous presence of these genes was in line the metabolic pathways described previously in other species. We also considered the genetic origin (transduction, rearrangement) of the corresponding genomic islands. Ribose and gluconate were degraded to a greater or lesser extent (quantitative phenotype) by some strains. For these sugars, the phenotypes could not be explained by the presence/absence of a gene but correlated with the premature appearance of a stop codon interrupting protein synthesis and preventing the catabolism of corresponding carbon sources. CONCLUSION: These results illustrate (i) the power of correlation studies to discover the genetic basis of binary strain-dependent traits, and (ii) the plasticity of PF chromosomes, probably resulting from horizontal transfers, duplications, transpositions and an accumulation of mutations. Knowledge of the genetic basis of nitrogen and sugar degradation opens up new strategies for the screening of PF strain collections to enable optimum cheese starter, probiotic and white biotechnology applications.


Assuntos
Metabolismo dos Carboidratos/genética , Genoma Bacteriano , Ilhas Genômicas/genética , Propionibacterium/genética , Queijo/microbiologia , DNA Bacteriano/análise , DNA Bacteriano/isolamento & purificação , DNA Bacteriano/metabolismo , Sequenciamento de Nucleotídeos em Larga Escala , Redes e Vias Metabólicas/genética , Mutação , Nitratos/metabolismo , Fenótipo , Filogenia , Propionibacterium/classificação , Análise de Sequência de DNA , Especificidade da Espécie
17.
Bioinformatics ; 30(20): 2959-61, 2014 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-24990603

RESUMO

MOTIVATION: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation. RESULTS: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints. AVAILABILITY AND IMPLEMENTATION: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license. CONTACT: lavenier@irisa.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bioestatística/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Gráficos por Computador , Genoma Humano/genética , Humanos
18.
Nat Comput Sci ; 4(2): 104-109, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38413777

RESUMO

Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.


Assuntos
Genômica , Água do Mar , Oceanos e Mares , Metagenoma/genética , Bases de Dados de Ácidos Nucleicos
19.
BMC Bioinformatics ; 13: 48, 2012 Mar 23.
Artigo em Inglês | MEDLINE | ID: mdl-22443449

RESUMO

BACKGROUND: The analysis of next-generation sequencing data from large genomes is a timely research topic. Sequencers are producing billions of short sequence fragments from newly sequenced organisms. Computational methods for reconstructing whole genomes/transcriptomes (de novo assemblers) are typically employed to process such data. However, these methods require large memory resources and computation time. Many basic biological questions could be answered targeting specific information in the reads, thus avoiding complete assembly. RESULTS: We present Mapsembler, an iterative micro and targeted assembler which processes large datasets of reads on commodity hardware. Mapsembler checks for the presence of given regions of interest that can be constructed from reads and builds a short assembly around it, either as a plain sequence or as a graph, showing contextual structure. We introduce new algorithms to retrieve approximate occurrences of a sequence from reads and construct an extension graph. Among other results presented in this paper, Mapsembler enabled to retrieve previously described human breast cancer candidate fusion genes, and to detect new ones not previously known. CONCLUSIONS: Mapsembler is the first software that enables de novo discovery around a region of interest of repeats, SNPs, exon skipping, gene fusion, as well as other structural events, directly from raw sequencing reads. As indexing is localized, the memory footprint of Mapsembler is negligible. Mapsembler is released under the CeCILL license and can be freely downloaded from http://alcovna.genouest.org/mapsembler/.


Assuntos
Algoritmos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Neoplasias da Mama/genética , Computadores , Feminino , Genoma , Humanos , Fusão Oncogênica/genética , Polimorfismo de Nucleotídeo Único , Transcriptoma
20.
BMC Bioinformatics ; 13 Suppl 19: S10, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23282463

RESUMO

BACKGROUND: Nowadays, metagenomic sample analyses are mainly achieved by comparing them with a priori knowledge stored in data banks. While powerful, such approaches do not allow to exploit unknown and/or "unculturable" species, for instance estimated at 99% for Bacteria. METHODS: This work introduces Compareads, a de novo comparative metagenomic approach that returns the reads that are similar between two possibly metagenomic datasets generated by High Throughput Sequencers. One originality of this work consists in its ability to deal with huge datasets. The second main contribution presented in this paper is the design of a probabilistic data structure based on Bloom filters enabling to index millions of reads with a limited memory footprint and a controlled error rate. RESULTS: We show that Compareads enables to retrieve biological information while being able to scale to huge datasets. Its time and memory features make Compareads usable on read sets each composed of more than 100 million Illumina reads in a few hours and consuming 4 GB of memory, and thus usable on today's personal computers. CONCLUSION: Using a new data structure, Compareads is a practical solution for comparing de novo huge metagenomic samples. Compareads is released under the CeCILL license and can be freely downloaded from http://alcovna.genouest.org/compareads/.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Metagenômica/métodos , Algoritmos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA