Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 37
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Nat Methods ; 19(4): 429-440, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35396482

RESUMO

Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.


Assuntos
Metagenoma , Metagenômica , Archaea/genética , Metagenômica/métodos , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Software
2.
Bioinformatics ; 39(39 Suppl 1): i270-i278, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387169

RESUMO

MOTIVATION: Structural variation (SV) is a class of genetic diversity whose importance is increasingly revealed by genome resequencing, especially with long-read technologies. One crucial problem when analyzing and comparing SVs in several individuals is their accurate genotyping, that is determining whether a described SV is present or absent in one sequenced individual, and if present, in how many copies. There are only a few methods dedicated to SV genotyping with long-read data, and all either suffer of a bias toward the reference allele by not representing equally all alleles, or have difficulties genotyping close or overlapping SVs due to a linear representation of the alleles. RESULTS: We present SVJedi-graph, a novel method for SV genotyping that relies on a variation graph to represent in a single data structure all alleles of a set of SVs. The long reads are mapped on the variation graph and the resulting alignments that cover allele-specific edges in the graph are used to estimate the most likely genotype for each SV. Running SVJedi-graph on simulated sets of close and overlapping deletions showed that this graph model prevents the bias toward the reference alleles and allows maintaining high genotyping accuracy whatever the SV proximity, contrary to other state of the art genotypers. On the human gold standard HG002 dataset, SVJedi-graph obtained the best performances, genotyping 99.5% of the high confidence SV callset with an accuracy of 95% in less than 30 min. AVAILABILITY AND IMPLEMENTATION: SVJedi-graph is distributed under an AGPL license and available on GitHub at https://github.com/SandraLouise/SVJedi-graph and as a BioConda package.


Assuntos
Genótipo , Humanos , Alelos , Análise de Sequência de DNA
3.
BMC Bioinformatics ; 24(1): 284, 2023 Jul 14.
Artigo em Inglês | MEDLINE | ID: mdl-37452278

RESUMO

BACKGROUND: Local assembly with short and long reads has proven to be very useful in many applications: reconstruction of the sequence of a locus of interest, gap-filling in draft assemblies, as well as alternative allele reconstruction of large Structural Variants. Whereas linked-read technologies have a great potential to assemble specific loci as they provide long-range information while maintaining the power and accuracy of short-read sequencing, there is a lack of local assembly tools for linked-read data. RESULTS: We present MTG-Link, a novel local assembly tool dedicated to linked-reads. The originality of the method lies in its read subsampling step which takes advantage of the barcode information contained in linked-reads mapped in flanking regions. We validated our approach on several datasets from different linked-read technologies. We show that MTG-Link is able to assemble successfully large sequences, up to dozens of Kb. We also demonstrate that the read subsampling step of MTG-Link considerably improves the local assembly of specific loci compared to other existing short-read local assembly tools. Furthermore, MTG-Link was able to fully characterize large insertion variants and deletion breakpoints in a human genome and to reconstruct dark regions in clinically-relevant human genes. It also improved the contiguity of a 1.3 Mb locus of biological interest in several individual genomes of the mimetic butterfly Heliconius numata. CONCLUSIONS: MTG-Link is an efficient local assembly tool designed for different linked-read sequencing technologies. MTG-Link source code is available at https://github.com/anne-gcd/MTG-Link and as a Bioconda package.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Humanos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genoma Humano
4.
Bioinformatics ; 36(17): 4568-4575, 2020 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-32437523

RESUMO

MOTIVATION: Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. RESULTS: We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. AVAILABILITY AND IMPLEMENTATION: https://github.com/llecompte/SVJedi.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma Humano , Software , Variação Estrutural do Genoma , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
5.
Bioinformatics ; 36(4): 1275-1276, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31504187

RESUMO

MOTIVATION: De novo comparative metagenomics is one of the most straightforward ways to analyze large sets of metagenomic data. Latest methods use the fraction of shared k-mers to estimate genomic similarity between read sets. However, those methods, while extremely efficient, are still limited by computational needs for practical usage outside of large computing facilities. RESULTS: We present SimkaMin, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities. One billion metagenomic reads can be analyzed in <3 min, with tiny memory (1.09 GB) and disk (≈0.3 GB) requirements and without altering the quality of the downstream comparative analyses, making of SimkaMin a tool perfectly tailored for very large-scale metagenomic projects. AVAILABILITY AND IMPLEMENTATION: https://github.com/GATB/simka. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metagenômica , Software , Algoritmos , Genômica , Metagenoma , Análise de Sequência de DNA
6.
BMC Biol ; 18(1): 89, 2020 07 24.
Artigo em Inglês | MEDLINE | ID: mdl-32703219

RESUMO

BACKGROUND: Polydnaviruses (PDVs) are mutualistic endogenous viruses inoculated by some lineages of parasitoid wasps into their hosts, where they facilitate successful wasp development. PDVs include the ichnoviruses and bracoviruses that originate from independent viral acquisitions in ichneumonid and braconid wasps respectively. PDV genomes are fully incorporated into the wasp genomes and consist of (1) genes involved in viral particle production, which derive from the viral ancestor and are not encapsidated, and (2) proviral segments harboring virulence genes, which are packaged into the viral particle. To help elucidating the mechanisms that have facilitated viral domestication in ichneumonid wasps, we analyzed the structure of the viral insertions by sequencing the whole genome of two ichnovirus-carrying wasp species, Hyposoter didymator and Campoletis sonorensis. RESULTS: Assemblies with long scaffold sizes allowed us to unravel the organization of the endogenous ichnovirus and revealed considerable dispersion of the viral loci within the wasp genomes. Proviral segments contained species-specific sets of genes and occupied distinct genomic locations in the two ichneumonid wasps. In contrast, viral machinery genes were organized in clusters showing highly conserved gene content and order, with some loci located in collinear wasp genomic regions. This genomic architecture clearly differs from the organization of PDVs in braconid wasps, in which proviral segments are clustered and viral machinery elements are more dispersed. CONCLUSIONS: The contrasting structures of the two types of ichnovirus genomic elements are consistent with their different functions: proviral segments are vehicles for virulence proteins expected to adapt according to different host defense systems, whereas the genes involved in virus particle production in the wasp are likely more stable and may reflect ancestral viral architecture. The distinct genomic architectures seen in ichnoviruses versus bracoviruses reveal different evolutionary trajectories that have led to virus domestication in the two wasp lineages.


Assuntos
Evolução Molecular , Genoma Viral , Interações entre Hospedeiro e Microrganismos , Polydnaviridae/genética , Vespas/virologia , Animais , Especificidade da Espécie , Sequenciamento Completo do Genoma
7.
BMC Genomics ; 21(1): 762, 2020 Nov 04.
Artigo em Inglês | MEDLINE | ID: mdl-33148192

RESUMO

BACKGROUND: Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools. RESULTS: In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls. CONCLUSIONS: Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations.


Assuntos
Genoma , Genômica , Algoritmos , Sequência de Bases , Humanos , Análise de Sequência , Análise de Sequência de DNA
8.
Nat Methods ; 14(11): 1063-1071, 2017 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-28967888

RESUMO

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.


Assuntos
Metagenômica , Software , Algoritmos , Benchmarking , Análise de Sequência de DNA
9.
Mol Ecol ; 29(7): 1328-1343, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-32145112

RESUMO

Hybrid zones, whereby divergent lineages come into contact and eventually hybridize, can provide insights on the mechanisms involved in population differentiation and reproductive isolation, and ultimately speciation. Suture zones offer the opportunity to compare these processes across multiple species. In this paper we use reduced-complexity genomic data to compare the genetic and phenotypic structure and hybridization patterns of two mimetic butterfly species, Ithomia salapia and Oleria onega (Nymphalidae: Ithomiini), each consisting of a pair of lineages differentiated for their wing colour pattern and that come into contact in the Andean foothills of Peru. Despite similarities in their life history, we highlight major differences, both at the genomic and phenotypic level, between the two species. These differences include the presence of hybrids, variations in wing phenotype, and genomic patterns of introgression and differentiation. In I. salapia, the two lineages appear to hybridize only rarely, whereas in O. onega the hybrids are not only more common, but also genetically and phenotypically more variable. We also detected loci statistically associated with wing colour pattern variation, but in both species these loci were not over-represented among the candidate barrier loci, suggesting that traits other than wing colour pattern may be important for reproductive isolation. Our results contrast with the genomic patterns observed between hybridizing lineages in the mimetic Heliconius butterflies, and call for a broader investigation into the genomics of speciation in Ithomiini - the largest radiation of mimetic butterflies.


Assuntos
Borboletas/genética , Genética Populacional , Hibridização Genética , Animais , Borboletas/classificação , Especiação Genética , Genoma de Inseto , Genótipo , Peru , Fenótipo , Polimorfismo de Nucleotídeo Único , Isolamento Reprodutivo , Asas de Animais/anatomia & histologia
10.
Mol Ecol ; 2018 Jul 16.
Artigo em Inglês | MEDLINE | ID: mdl-30010213

RESUMO

Identifying the genomic bases of adaptation to novel environments is a long-term objective in evolutionary biology. Because genetic differentiation is expected to increase between locally adapted populations at the genes targeted by selection, scanning the genome for elevated levels of differentiation is a first step towards deciphering the genomic architecture underlying adaptive divergence. The pea aphid Acyrthosiphon pisum is a model of choice to address this question, as it forms a large complex of plant-specialized races and cryptic species, resulting from recent adaptive radiation. Here, we characterized genomewide polymorphisms in three pea aphid races specialized on alfalfa, clover and pea crops, respectively, which we sequenced in pools (poolseq). Using a model-based approach that explicitly accounts for selection, we identified 392 genomic hotspots of differentiation spanning 47.3 Mb and 2,484 genes (respectively, 9.12% of the genome size and 8.10% of its genes). Most of these highly differentiated regions were located on the autosomes, and overall differentiation was weaker on the X chromosome. Within these hotspots, high levels of absolute divergence between races suggest that these regions experienced less gene flow than the rest of the genome, most likely by contributing to reproductive isolation. Moreover, population-specific analyses showed evidence of selection in every host race, depending on the hotspot considered. These hotspots were significantly enriched for candidate gene categories that control host-plant selection and use. These genes encode 48 salivary proteins, 14 gustatory receptors, 10 odorant receptors, five P450 cytochromes and one chemosensory protein, which represent promising candidates for the genetic basis of host-plant specialization and ecological isolation in the pea aphid complex. Altogether, our findings open new research directions towards functional studies, for validating the role of these genes on adaptive phenotypes.

11.
Nucleic Acids Res ; 43(2): e11, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25404127

RESUMO

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, discoSnp, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, discoSnp ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, discoSnp requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.


Assuntos
Técnicas de Genotipagem/métodos , Polimorfismo de Nucleotídeo Único , Algoritmos , Animais , Cromossomos Humanos Par 1 , Escherichia coli/genética , Genômica/métodos , Humanos , Ixodes/genética , Camundongos , Camundongos Endogâmicos C57BL , Saccharomyces cerevisiae/genética
12.
BMC Bioinformatics ; 16: 288, 2015 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-26370285

RESUMO

BACKGROUND: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS: We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS: LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.


Assuntos
Algoritmos , Proteínas de Caenorhabditis elegans/genética , Caenorhabditis elegans/genética , Gráficos por Computador , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Animais , Biologia Computacional/métodos , Simulação por Computador , Metagenômica , Probabilidade
13.
Bioinformatics ; 30(24): 3451-7, 2014 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-25123898

RESUMO

MOTIVATION: Insertions play an important role in genome evolution. However, such variants are difficult to detect from short-read sequencing data, especially when they exceed the paired-end insert size. Many approaches have been proposed to call short insertion variants based on paired-end mapping. However, there remains a lack of practical methods to detect and assemble long variants. RESULTS: We propose here an original method, called MindTheGap, for the integrated detection and assembly of insertion variants from re-sequencing data. Importantly, it is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. MindTheGap uses an efficient k-mer-based method to detect insertion sites in a reference genome, and subsequently assemble them from the donor reads. MindTheGap showed high recall and precision on simulated datasets of various genome complexities. When applied to real Caenorhabditis elegans and human NA12878 datasets, MindTheGap detected and correctly assembled insertions >1 kb, using at most 14 GB of memory.


Assuntos
Mutagênese Insercional , Análise de Sequência de DNA/métodos , Animais , Caenorhabditis elegans/genética , Genoma , Humanos , Software
14.
Bioinformatics ; 30(20): 2959-61, 2014 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-24990603

RESUMO

MOTIVATION: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation. RESULTS: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints. AVAILABILITY AND IMPLEMENTATION: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license. CONTACT: lavenier@irisa.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bioestatística/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Gráficos por Computador , Genoma Humano/genética , Humanos
15.
Genetica ; 143(3): 305-16, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25694156

RESUMO

The moth Spodoptera frugiperda is a well-known pest of crops throughout the Americas, which consists of two strains adapted to different host-plants: the first feeds preferentially on corn, cotton and sorghum whereas the second is more associated with rice and several pasture grasses. Though morphologically indistinguishable, they exhibit differences in their mating behavior, pheromone compositions, and show development variability according to the host-plant. Though the latter suggest that both strains are different species, this issue is still highly controversial because hybrids naturally occur in the wild, not to mention the discrepancies among published results concerning mating success between the two strains. In order to clarify the status of the two host-plant strains of S. frugiperda, we analyze features that possibly reflect the level of post-zygotic isolation: (1) first generation (F1) hybrid lethality and sterility; (2) patterns of meiotic segregation of hybrids in reciprocal second generation (F2), as compared to the meiosis of the two parental strains. We found a significant reduction of mating success in F1 in one direction of the cross and a high level of microsatellite markers showing transmission ratio distortion in the F2 progeny. Our results support the existence of post-zygotic reproductive isolation between the two laboratory strains and are in accordance with the marked level of genetic differentiation that was recovered between individuals of the two strains collected from the field. Altogether these results provide additional evidence in favor of a sibling species status for the two strains.


Assuntos
Cruzamentos Genéticos , Especificidade de Hospedeiro , Spodoptera/classificação , Animais , Feminino , Fertilidade/genética , Marcadores Genéticos , Técnicas de Genotipagem , Hibridização Genética , Masculino , Repetições de Microssatélites , Oryza , Spodoptera/genética , Zea mays
16.
Genome Biol Evol ; 16(3)2024 03 02.
Artigo em Inglês | MEDLINE | ID: mdl-38491969

RESUMO

We present the first chromosome-level genome assembly and annotation of the pearly heath Coenonympha arcania, generated with a PacBio HiFi sequencing approach and complemented with Hi-C data. We additionally compare synteny, gene, and repeat content between C. arcania and other Lepidopteran genomes. This reference genome will enable future population genomics studies with Coenonympha butterflies, a species-rich genus that encompasses some of the most highly endangered butterfly taxa in Europe.


Assuntos
Borboletas , Animais , Borboletas/genética , Genoma , Cromossomos/genética , Sintenia , Europa (Continente) , Anotação de Sequência Molecular
17.
Mol Ecol Resour ; 23(4): 872-885, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-36533297

RESUMO

The ithomiine butterflies (Nymphalidae: Danainae) represent the largest known radiation of Müllerian mimetic butterflies. They dominate by number the mimetic butterfly communities, which include species such as the iconic neotropical Heliconius genus. Recent studies on the ecology and genetics of speciation in Ithomiini have suggested that sexual pheromones, colour pattern and perhaps hostplant could drive reproductive isolation. However, no reference genome was available for Ithomiini, which has hindered further exploration on the genetic architecture of these candidate traits, and more generally on the genomic patterns of divergence. Here, we generated high-quality, chromosome-scale genome assemblies for two Melinaea species, M. marsaeus and M. menophilus, and a draft genome of the species Ithomia salapia. We obtained genomes with a size ranging from 396 to 503 Mb across the three species and scaffold N50 of 40.5 and 23.2 Mb for the two chromosome-scale assemblies. Using collinearity analyses we identified massive rearrangements between the two closely related Melinaea species. An annotation of transposable elements and gene content was performed, as well as a specialist annotation to target chemosensory genes, which is crucial for host plant detection and mate recognition in mimetic species. A comparative genomic approach revealed independent gene expansions in ithomiines and particularly in gustatory receptor genes. These first three genomes of ithomiine mimetic butterflies constitute a valuable addition and a welcome comparison to existing biological models such as Heliconius, and will enable further understanding of the mechanisms of adaptation in butterflies.


Assuntos
Borboletas , Animais , Borboletas/genética , Adaptação Fisiológica , Fenótipo , Genômica , Cromossomos/genética
18.
BMC Bioinformatics ; 13 Suppl 19: S10, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23282463

RESUMO

BACKGROUND: Nowadays, metagenomic sample analyses are mainly achieved by comparing them with a priori knowledge stored in data banks. While powerful, such approaches do not allow to exploit unknown and/or "unculturable" species, for instance estimated at 99% for Bacteria. METHODS: This work introduces Compareads, a de novo comparative metagenomic approach that returns the reads that are similar between two possibly metagenomic datasets generated by High Throughput Sequencers. One originality of this work consists in its ability to deal with huge datasets. The second main contribution presented in this paper is the design of a probabilistic data structure based on Bloom filters enabling to index millions of reads with a limited memory footprint and a controlled error rate. RESULTS: We show that Compareads enables to retrieve biological information while being able to scale to huge datasets. Its time and memory features make Compareads usable on read sets each composed of more than 100 million Illumina reads in a few hours and consuming 4 GB of memory, and thus usable on today's personal computers. CONCLUSION: Using a new data structure, Compareads is a practical solution for comparing de novo huge metagenomic samples. Compareads is released under the CeCILL license and can be freely downloaded from http://alcovna.genouest.org/compareads/.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Metagenômica/métodos , Algoritmos
19.
Elife ; 112022 08 03.
Artigo em Inglês | MEDLINE | ID: mdl-35920817

RESUMO

Biogeographical studies have traditionally focused on readily visible organisms, but recent technological advances are enabling analyses of the large-scale distribution of microscopic organisms, whose biogeographical patterns have long been debated. Here we assessed the global structure of plankton geography and its relation to the biological, chemical, and physical context of the ocean (the 'seascape') by analyzing metagenomes of plankton communities sampled across oceans during the Tara Oceans expedition, in light of environmental data and ocean current transport. Using a consistent approach across organismal sizes that provides unprecedented resolution to measure changes in genomic composition between communities, we report a pan-ocean, size-dependent plankton biogeography overlying regional heterogeneity. We found robust evidence for a basin-scale impact of transport by ocean currents on plankton biogeography, and on a characteristic timescale of community dynamics going beyond simple seasonality or life history transitions of plankton.


Oceans are brimming with life invisible to our eyes, a myriad of species of bacteria, viruses and other microscopic organisms essential for the health of the planet. These 'marine plankton' are unable to swim against currents and should therefore be constantly on the move, yet previous studies have suggested that distinct species of plankton may in fact inhabit different oceanic regions. However, proving this theory has been challenging; collecting plankton is logistically difficult, and it is often impossible to distinguish between species simply by examining them under a microscope. However, within the last decade, a research schooner called Tara has travelled the globe to gather thousands of plankton samples. At the same time, advances in genomics have made it possible to identify species based only on fragments of their DNA sequence. To understand the hidden geography of plankton communities in Earth's oceans, Richter et al. pored over DNA from the Tara Oceans expedition. This revealed that, despite being unable to resist the flow of water, various planktonic species which live close to the surface manage to occupy distinct, stable provinces shaped by currents. Different sizes of plankton are distributed in different sized provinces, with the smallest organisms tending to inhabit the smallest areas. Comparing DNA similarities and speeds of currents at the ocean surface revealed how these might stretch and mix plankton communities. Plankton play a critical role in the health of the ocean and the chemical cycles of planet Earth. These results could allow deeper investigation by marine modellers, ecologists, and evolutionary biologists. Meanwhile, work is already underway to investigate how climate change might impact this hidden geography.


Assuntos
Ecossistema , Plâncton , Genômica , Geografia , Oceanos e Mares , Plâncton/genética
20.
BMC Bioinformatics ; 12: 457, 2011 Nov 24.
Artigo em Inglês | MEDLINE | ID: mdl-22115330

RESUMO

BACKGROUND: Substitution matrices are key parameters for the alignment of two protein sequences, and consequently for most comparative genomics studies. The composition of biological sequences can vary importantly between species and groups of species, and classical matrices such as those in the BLOSUM series fail to accurately estimate alignment scores and statistical significance with sequences sharing marked compositional biases. RESULTS: We present a general and simple methodology to build matrices that are especially fitted to the compositional bias of proteins. Our approach is inspired from the one used to build the BLOSUM matrices and is based on learning substitution and amino acid frequencies on real sequences with the corresponding compositional bias. We applied it to the large scale comparison of Mollicute AT-rich genomes. The new matrix, MOLLI60, was used to predict pairwise orthology relationships, as well as homolog families among 24 Mollicute genomes. We show that this new matrix enables to better discriminate between true and false orthologs and improves the clustering of homologous proteins, with respect to the use of the classical matrix BLOSUM62. CONCLUSIONS: We show in this paper that well-fitted matrices can improve the predictions of orthologous and homologous relationships among proteins with a similar compositional bias. With the ever-increasing number of sequenced genomes, our approach could prove valuable in numerous comparative studies focusing on atypical genomes.


Assuntos
Genômica/métodos , Software , Tenericutes/classificação , Proteínas de Bactérias/genética , Homologia de Sequência de Aminoácidos , Tenericutes/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA