Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
Bioinformatics ; 36(22-23): 5344-5350, 2021 Apr 01.
Artículo en Inglés | MEDLINE | ID: mdl-33346833

RESUMEN

MOTIVATION: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. RESULTS: Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. AVAILABILITY AND IMPLEMENTATION: Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

2.
Bioinformatics ; 35(19): 3547-3552, 2019 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-30994912

RESUMEN

MOTIVATION: Although modern high-throughput biomolecular technologies produce various types of data, biosequence data remain at the core of bioinformatic analyses. However, computational techniques for dealing with this data evolved dramatically. RESULTS: In this bird's-eye review, we overview the evolution of main algorithmic techniques for comparing and searching biological sequences. We highlight key algorithmic ideas emerged in response to several interconnected factors: shifts of biological analytical paradigm, advent of new sequencing technologies and a substantial increase in size of the available data. We discuss the expansion of alignment-free techniques coming to replace alignment-based algorithms in large-scale analyses. We further emphasize recently emerged and growing applications of sketching methods which support comparison of massive datasets, such as metagenomics samples. Finally, we focus on the transition to population genomics and outline associated algorithmic challenges.


Asunto(s)
Algoritmos , Metagenómica , Biología Computacional , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia , Encuestas y Cuestionarios
3.
Bioinformatics ; 32(1): 136-9, 2016 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-26353839

RESUMEN

MOTIVATION: Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created.In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads. RESULTS: To solve this obstacle, we have created a generic format Read Naming Format (Rnf) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RnfTools containing two principal components. MIShmash applies one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim, etc.) and transforms the generated reads into Rnf format. LAVEnder evaluates then a given read mapper using simulated reads in Rnf format. A special attention is payed to mapping qualities that serve for parametrization of Roc curves, and to evaluation of the effect of read sample contamination. AVAILABILITY AND IMPLEMENTATION: RnfTools: http://karel-brinda.github.io/rnftools Spec. of Rnf: http://karel-brinda.github.io/rnf-spec CONTACT: karel.brinda@univ-mlv.fr.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Simulación por Computador , Genoma , Humanos
4.
Bioinformatics ; 31(22): 3584-92, 2015 Nov 15.
Artículo en Inglés | MEDLINE | ID: mdl-26209798

RESUMEN

MOTIVATION: Metagenomics is a powerful approach to study genetic content of environmental samples, which has been strongly promoted by next-generation sequencing technologies. To cope with massive data involved in modern metagenomic projects, recent tools rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. RESULTS: Within this general framework, we show that spaced seeds provide a significant improvement of classification accuracy, as opposed to traditional contiguous k-mers. We support this thesis through a series of different computational experiments, including simulations of large-scale metagenomic projects.Availability and implementation, Supplementary information: Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics. CONTACT: gregory.kucherov@univ-mlv.fr.


Asunto(s)
Algoritmos , Metagenómica/clasificación , Bacillus/genética , Bases de Datos Genéticas , Genoma Bacteriano , Mycobacterium/genética , Probabilidad , Alineación de Secuencia , Estadísticas no Paramétricas
5.
bioRxiv ; 2023 Apr 18.
Artículo en Inglés | MEDLINE | ID: mdl-37131636

RESUMEN

Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections makes it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.

6.
J Comput Biol ; 29(2): 140-154, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35049334

RESUMEN

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.


Asunto(s)
Biología Computacional/métodos , Genómica/estadística & datos numéricos , Programas Informáticos , Algoritmos , Animales , Gráficos por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Genoma Humano , Humanos , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos
7.
Algorithms Mol Biol ; 17(1): 5, 2022 Mar 21.
Artículo en Inglés | MEDLINE | ID: mdl-35317833

RESUMEN

MOTIVATION: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. RESULTS: In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k's.

8.
Genome Biol ; 22(1): 96, 2021 04 06.
Artículo en Inglés | MEDLINE | ID: mdl-33823902

RESUMEN

de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Genómica/métodos
9.
J Bacteriol ; 192(19): 5143-50, 2010 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-20693331

RESUMEN

Nonribosomal peptides (NRPs) are molecules produced by microorganisms that have a broad spectrum of biological activities and pharmaceutical applications (e.g., antibiotic, immunomodulating, and antitumor activities). One particularity of the NRPs is the biodiversity of their monomers, extending far beyond the 20 proteogenic amino acid residues. Norine, a comprehensive database of NRPs, allowed us to review for the first time the main characteristics of the NRPs and especially their monomer biodiversity. Our analysis highlighted a significant similarity relationship between NRPs synthesized by bacteria and those isolated from metazoa, especially from sponges, supporting the hypothesis that some NRPs isolated from sponges are actually synthesized by symbiotic bacteria rather than by the sponges themselves. A comparison of peptide monomeric compositions as a function of biological activity showed that some monomers are specific to a class of activities. An analysis of the monomer compositions of peptide products predicted from genomic information (metagenomics and high-throughput genome sequencing) or of new peptides detected by mass spectrometry analysis applied to a culture supernatant can provide indications of the origin of a peptide and/or its biological activity.


Asunto(s)
Péptidos/química , Bases de Datos Factuales , Modelos Teóricos , Péptido Sintasas/metabolismo , Péptidos/metabolismo
10.
Trends Genet ; 23(11): 543-6, 2007 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-17964682

RESUMEN

By conventional wisdom, a feature that occurs too often or too rarely in a genome can indicate a functional element. To infer functionality from frequency, it is crucial to precisely characterize occurrences in randomly evolving DNA. We find that the frequency of oligonucleotides in a genomic sequence follows primarily a Pareto-lognormal distribution, which encapsulates lognormal and power-law features found across all known genomes. Such a distribution could be the result of completely random evolution by a copying process. Our characterization of the entire frequency distribution of genomic words opens a way to a more accurate reasoning about their over- and underrepresentation in genomic sequences.


Asunto(s)
Genómica , Animales , Evolución Molecular , Duplicación de Gen , Genoma , Humanos , Cadenas de Markov , Oligonucleótidos/metabolismo
11.
Nucleic Acids Res ; 36(Database issue): D326-31, 2008 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-17913739

RESUMEN

Norine is the first database entirely dedicated to nonribosomal peptides (NRPs). In bacteria and fungi, in addition to the traditional ribosomal proteic biosynthesis, an alternative ribosome-independent pathway called NRP synthesis allows peptide production. It is performed by huge protein complexes called nonribosomal peptide synthetases (NRPSs). The molecules synthesized by NRPS contain a high proportion of nonproteogenic amino acids. The primary structure of these peptides is not always linear but often more complex and may contain cycles and branchings. In recent years, NRPs attracted a lot of attention because of their biological activities and pharmacological properties (antibiotic, immunosuppressor, antitumor, etc.). However, few computational resources and tools dedicated to those peptides have been available so far. Norine is focused on NRPs and contains more than 700 entries. The database is freely accessible at http://bioinfo.lifl.fr/norine/. It provides a complete computational tool for systematic study of NRPs in numerous species, and as such, should permit to obtain a better knowledge of these metabolic products and underlying biological mechanisms, and ultimately to contribute to the redesigning of natural products in order to obtain new bioactive compounds for drug discovery.


Asunto(s)
Bases de Datos de Proteínas , Biosíntesis de Péptidos Independientes de Ácidos Nucleicos , Péptidos/química , Internet , Péptido Sintasas/metabolismo , Interfaz Usuario-Computador
12.
Nat Microbiol ; 5(3): 455-464, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-32042129

RESUMEN

Surveillance of drug-resistant bacteria is essential for healthcare providers to deliver effective empirical antibiotic therapy. However, traditional molecular epidemiology does not typically occur on a timescale that could affect patient treatment and outcomes. Here, we present a method called 'genomic neighbour typing' for inferring the phenotype of a bacterial sample by identifying its closest relatives in a database of genomes with metadata. We show that this technique can infer antibiotic susceptibility and resistance for both Streptococcus pneumoniae and Neisseria gonorrhoeae. We implemented this with rapid k-mer matching, which, when used on Oxford Nanopore MinION data, can run in real time. This resulted in the determination of resistance within 10 min (91% sensitivity and 100% specificity for S. pneumoniae and 81% sensitivity and 100% specificity for N. gonorrhoeae from isolates with a representative database) of starting sequencing, and within 4 h of sample collection (75% sensitivity and 100% specificity for S. pneumoniae) for clinical metagenomic sputum samples. This flexible approach has wide application for pathogen surveillance and may be used to greatly accelerate appropriate empirical antibiotic treatment.


Asunto(s)
Antibacterianos/farmacología , Técnicas de Tipificación Bacteriana/métodos , Farmacorresistencia Bacteriana Múltiple/efectos de los fármacos , Farmacorresistencia Bacteriana Múltiple/genética , Genómica , Bases de Datos Factuales , Humanos , Pruebas de Sensibilidad Microbiana/métodos , Epidemiología Molecular , Neisseria gonorrhoeae/efectos de los fármacos , Neisseria gonorrhoeae/genética , Neisseria gonorrhoeae/aislamiento & purificación , Fenotipo , Sensibilidad y Especificidad , Streptococcus pneumoniae/efectos de los fármacos , Streptococcus pneumoniae/genética , Streptococcus pneumoniae/aislamiento & purificación
13.
BMC Struct Biol ; 9: 15, 2009 Mar 18.
Artículo en Inglés | MEDLINE | ID: mdl-19296847

RESUMEN

BACKGROUND: Nonribosomal peptides (NRPs), bioactive secondary metabolites produced by many microorganisms, show a broad range of important biological activities (e.g. antibiotics, immunosuppressants, antitumor agents). NRPs are mainly composed of amino acids but their primary structure is not always linear and can contain cycles or branchings. Furthermore, there are several hundred different monomers that can be incorporated into NRPs. The NORINE database, the first resource entirely dedicated to NRPs, currently stores more than 700 NRPs annotated with their monomeric peptide structure encoded by undirected labeled graphs. This opens a way to a systematic analysis of structural patterns occurring in NRPs. Such studies can investigate the functional role of some monomeric chains, or analyse NRPs that have been computationally predicted from the synthetase protein sequence. A basic operation in such analyses is the search for a given structural pattern in the database. RESULTS: We developed an efficient method that allows for a quick search for a structural pattern in the NORINE database. The method identifies all peptides containing a pattern substructure of a given size. This amounts to solving a variant of the maximum common subgraph problem on pattern and peptide graphs, which is done by computing cliques in an appropriate compatibility graph. CONCLUSION: The method has been incorporated into the NORINE database, available at http://bioinfo.lifl.fr/norine. Less than one second is needed to search for a pattern in the entire database.


Asunto(s)
Bases de Datos de Proteínas , Biosíntesis de Péptidos Independientes de Ácidos Nucleicos , Péptidos/química , Internet , Conformación Proteica , Interfaz Usuario-Computador
14.
BMC Bioinformatics ; 9: 73, 2008 Jan 31.
Artículo en Inglés | MEDLINE | ID: mdl-18237374

RESUMEN

BACKGROUND: Many programs have been developed to identify transcription factor binding sites. However, most of them are not able to infer two-word motifs with variable spacer lengths. This case is encountered for RNA polymerase Sigma (sigma) Factor Binding Sites (SFBSs) usually composed of two boxes, called -35 and -10 in reference to the transcription initiation point. Our goal is to design an algorithm detecting SFBS by using combinational and statistical constraints deduced from biological observations. RESULTS: We describe a new approach to identify SFBSs by comparing two related bacterial genomes. The method, named SIGffRid (SIGma Factor binding sites Finder using R'MES to select Input Data), performs a simultaneous analysis of pairs of promoter regions of orthologous genes. SIGffRid uses a prior identification of over-represented patterns in whole genomes as selection criteria for potential -35 and -10 boxes. These patterns are then grouped using pairs of short seeds (of which one is possibly gapped), allowing a variable-length spacer between them. Next, the motifs are extended guided by statistical considerations, a feature that ensures a selection of motifs with statistically relevant properties. We applied our method to the pair of related bacterial genomes of Streptomyces coelicolor and Streptomyces avermitilis. Cross-check with the well-defined SFBSs of the SigR regulon in S. coelicolor is detailed, validating the algorithm. SFBSs for HrdB and BldN were also found; and the results suggested some new targets for these sigma factors. In addition, consensus motifs for BldD and new SFBSs binding sites were defined, overlapping previously proposed consensuses. Relevant tests were carried out also on bacteria with moderate GC content (i.e. Escherichia coli/Salmonella typhimurium and Bacillus subtilis/Bacillus licheniformis pairs). Motifs of house-keeping sigma factors were found as well as other SFBSs such as that of SigW in Bacillus strains. CONCLUSION: We demonstrate that our approach combining statistical and biological criteria was successful to predict SFBSs. The method versatility authorizes the recognition of other kinds of two-box regulatory sites.


Asunto(s)
Algoritmos , Mapeo Cromosómico/métodos , Genoma Bacteriano/genética , Reconocimiento de Normas Patrones Automatizadas/métodos , Análisis de Secuencia de ADN/métodos , Factor sigma/genética , Programas Informáticos , Sitios de Unión , Unión Proteica
15.
BMC Bioinformatics ; 9: 534, 2008 Dec 16.
Artículo en Inglés | MEDLINE | ID: mdl-19087280

RESUMEN

BACKGROUND: Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet. RESULTS: The paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated e-value parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. Supplementary data can be found on the website http://bioinfo.lifl.fr/reblosum. CONCLUSION: We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction.


Asunto(s)
Indización y Redacción de Resúmenes/métodos , Algoritmos , Biología Computacional/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Bases de Datos de Proteínas , Almacenamiento y Recuperación de la Información , Proteínas/química
16.
BMC Genomics ; 8: 409, 2007 Nov 09.
Artículo en Inglés | MEDLINE | ID: mdl-17996080

RESUMEN

BACKGROUND: Transposable elements constitute a significant fraction of plant genomes. The PIF/Harbinger superfamily includes DNA transposons (class II elements) carrying terminal inverted repeats and producing a 3 bp target site duplication upon insertion. The presence of an ORF coding for the DDE/DDD transposase, required for transposition, is characteristic for the autonomous PIF/Harbinger-like elements. Based on the above features, PIF/Harbinger-like elements were identified in several plant genomes and divided into several evolutionary lineages. Availability of a significant portion of Medicago truncatula genomic sequence allowed for mining PIF/Harbinger-like elements, starting from a single previously described element MtMaster. RESULTS: Twenty two putative autonomous, i.e. carrying an ORF coding for TPase and complete terminal inverted repeats, and 67 non-autonomous PIF/Harbinger-like elements were found in the genome of M. truncatula. They were divided into five families, MtPH-A5, MtPH-A6, MtPH-D,MtPH-E, and MtPH-M, corresponding to three previously identified and two new lineages. The largest families, MtPH-A6 and MtPH-M were further divided into four and three subfamilies, respectively. Non-autonomous elements were usually direct deletion derivatives of the putative autonomous element, however other types of rearrangements, including inversions and nested insertions were also observed. An interesting structural characteristic - the presence of 60 bp tandem repeats - was observed in a group of elements of subfamily MtPH-A6-4. Some families could be related to miniature inverted repeat elements (MITEs). The presence of empty loci (RESites), paralogous to those flanking the identified transposable elements, both autonomous and non-autonomous, as well as the presence of transposon insertion related size polymorphisms, confirmed that some of the mined elements were capable for transposition. CONCLUSION: The population of PIF/Harbinger-like elements in the genome of M. truncatula is diverse. A detailed intra-family comparison of the elements' structure proved that they proliferated in the genome generally following the model of abortive gap repair. However, the presence of tandem repeats facilitated more pronounced rearrangements of the element internal regions. The insertion polymorphism of the MtPH elements and related MITE families in different populations of M. truncatula, if further confirmed experimentally, could be used as a source of molecular markers complementary to other marker systems.


Asunto(s)
Elementos Transponibles de ADN/genética , Variación Genética , Genoma de Planta/genética , Medicago truncatula/genética , Inversión Cromosómica , Evolución Molecular , Etiquetas de Secuencia Expresada , Medicago truncatula/enzimología , Repeticiones de Minisatélite , Familia de Multigenes , Mutagénesis Insercional/genética , Sistemas de Lectura Abierta/genética , Filogenia , Polimorfismo Genético , Alineación de Secuencia , Secuencias Repetidas Terminales/genética , Transposasas/genética
17.
Nucleic Acids Res ; 33(Web Server issue): W540-3, 2005 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-15980530

RESUMEN

YASS is a DNA local alignment tool based on an efficient and sensitive filtering algorithm. It applies transition-constrained seeds to specify the most probable conserved motifs between homologous sequences, combined with a flexible hit criterion used to identify groups of seeds that are likely to exhibit significant alignments. A web interface (http://www.loria.fr/projects/YASS/) is available to upload input sequences in fasta format, query the program and visualize the results obtained in several forms (dot-plot, tabular output and others). A standalone version is available for download from the web page.


Asunto(s)
Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Homología de Secuencia de Ácido Nucleico , Programas Informáticos , Algoritmos , Gráficos por Computador , Genómica/métodos , Internet , Interfaz Usuario-Computador
18.
J Bioinform Comput Biol ; 4(2): 553-69, 2006 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-16819802

RESUMEN

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem--a set of target alignments, an associated probability distribution, and a seed model--that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.


Asunto(s)
Algoritmos , Inteligencia Artificial , Reconocimiento de Normas Patrones Automatizadas/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Datos de Secuencia Molecular , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
19.
Nucleic Acids Res ; 31(13): 3672-8, 2003 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-12824391

RESUMEN

The presence of repeated sequences is a fundamental feature of genomes. Tandemly repeated DNA appears in both eukaryotic and prokaryotic genomes, it is associated with various regulatory mechanisms and plays an important role in genomic fingerprinting. In this paper, we describe mreps, a powerful software tool for a fast identification of tandemly repeated structures in DNA sequences. mreps is able to identify all types of tandem repeats within a single run on a whole genomic sequence. It has a resolution parameter that allows the program to identify 'fuzzy' repeats. We introduce main algorithmic solutions behind mreps, describe its usage, give some execution time benchmarks and present several case studies to illustrate its capabilities. The mreps web interface is accessible through http://www.loria.fr/mreps/.


Asunto(s)
Análisis de Secuencia de ADN/métodos , Programas Informáticos , Secuencias Repetidas en Tándem , Algoritmos , ADN/química , ADN Bacteriano/química , Genes Bacterianos , Genoma Humano , Genómica/métodos , Humanos , Internet , Neisseria meningitidis/genética , Polimorfismo Genético , Saccharomyces cerevisiae/genética
20.
Artículo en Inglés | MEDLINE | ID: mdl-17044164

RESUMEN

We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Kärkkäinen. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.


Asunto(s)
Algoritmos , Etiquetas de Secuencia Expresada , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Datos de Secuencia Molecular , Homología de Secuencia de Ácido Nucleico
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA