Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinform Adv ; 3(1): vbac085, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36698762

RESUMO

Motivation: Oxford Nanopore Technologies (ONT) sequencing has become very popular over the past few years and offers a cost-effective solution for many genomic and transcriptomic projects. One distinctive feature of the technology is that the protocol includes the ligation of adapters to both ends of each fragment. Those adapters should then be removed before downstream analyses, either during the basecalling step or by explicit trimming. This basic task may be tricky when the definition of the adapter sequence is not well documented. Results: We have developed a new method to scan a set of ONT reads to see if it contains adapters, without any prior knowledge on the sequence of the potential adapters, and then trim out those adapters. The algorithm is based on approximate k-mers and is able to discover adapter sequences based on their frequency alone. The method was successfully tested on a variety of ONT datasets with different flowcells, sequencing kits and basecallers. Availability and implementation: The resulting software, named Porechop_ABI, is open-source and is available at https://github.com/bonsai-team/Porechop_ABI. Supplementary information: Supplementary data are available at Bioinformatics advances online.

2.
Bioinformatics ; 36(22-23): 5344-5350, 2021 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-33346833

RESUMO

MOTIVATION: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. RESULTS: Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. AVAILABILITY AND IMPLEMENTATION: Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.
J Comput Biol ; 27(2): 223-233, 2020 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-31800307

RESUMO

Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time. In this article we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup in range of [3.5 × -7 × ], outperforming previous methods. Software and data sets are available at Iterative Spaced Seed Hashing.

4.
BMC Bioinformatics ; 19(1): 223, 2018 06 11.
Artigo em Inglês | MEDLINE | ID: mdl-29890948

RESUMO

BACKGROUND: Discovering over-represented approximate motifs in DNA sequences is an essential part of bioinformatics. This topic has been studied extensively because of the increasing number of potential applications. However, it remains a difficult challenge, especially with the huge quantity of data generated by high throughput sequencing technologies. To overcome this problem, existing tools use greedy algorithms and probabilistic approaches to find motifs in reasonable time. Nevertheless these approaches lack sensitivity and have difficulties coping with rare and subtle motifs. RESULTS: We developed DiNAMO (for DNA MOtif), a new software based on an exhaustive and efficient algorithm for IUPAC motif discovery. We evaluated DiNAMO on synthetic and real datasets with two different applications, namely ChIP-seq peaks and Systematic Sequencing Error analysis. DiNAMO proves to compare favorably with other existing methods and is robust to noise. CONCLUSIONS: We shown that DiNAMO software can serve as a tool to search for degenerate motifs in an exact manner using IUPAC models. DiNAMO can be used in scanning mode with sliding windows or in fixed position mode, which makes it suitable for numerous potential applications. AVAILABILITY: https://github.com/bonsai-team/DiNAMO .


Assuntos
Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Motivos de Nucleotídeos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Imunoprecipitação da Cromatina , Humanos
5.
Algorithms Mol Biol ; 12: 1, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28289437

RESUMO

BACKGROUND: Spaced seeds, also named gapped q-grams, gapped k-mers, spaced q-grams, have been proven to be more sensitive than contiguous seeds (contiguous q-grams, contiguous k-mers) in nucleic and amino-acid sequences analysis. Initially proposed to detect sequence similarities and to anchor sequence alignments, spaced seeds have more recently been applied in several alignment-free related methods. Unfortunately, spaced seeds need to be initially designed. This task is known to be time-consuming due to the number of spaced seed candidates. Moreover, it can be altered by a set of arbitrary chosen parameters from the probabilistic alignment models used. In this general context, Dominant seeds have been introduced by Mak and Benson (Bioinformatics 25:302-308, 2009) on the Bernoulli model, in order to reduce the number of spaced seed candidates that are further processed in a parameter-free calculation of the sensitivity. RESULTS: We expand the scope of work of Mak and Benson on single and multiple seeds by considering the Hit Integration model of Chung and Park (BMC Bioinform 11:31, 2010), demonstrate that the same dominance definition can be applied, and that a parameter-free study can be performed without any significant additional cost. We also consider two new discrete models, namely the Heaviside and the Dirac models, where lossless seeds can be integrated. From a theoretical standpoint, we establish a generic framework on all the proposed models, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter. From a practical standpoint, we confirm that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store. Moreover, in http://bioinfo.cristal.univ-lille.fr/yass/iedera_dominance, we provide a full list of spaced seeds computed on the four aforementioned models, with one (continuous) parameter left free for each model, and with several (discrete) alignment lengths.

6.
Nucleic Acids Res ; 44(D1): D1113-8, 2016 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-26527733

RESUMO

Since its creation in 2006, Norine remains the unique knowledgebase dedicated to non-ribosomal peptides (NRPs). These secondary metabolites, produced by bacteria and fungi, harbor diverse interesting biological activities (such as antibiotic, antitumor, siderophore or surfactant) directly related to the diversity of their structures. The Norine team goal is to collect the NRPs and provide tools to analyze them efficiently. We have developed a user-friendly interface and dedicated tools to provide a complete bioinformatics platform. The knowledgebase gathers abundant and valuable annotations on more than 1100 NRPs. To increase the quantity of described NRPs and improve the quality of associated annotations, we are now opening Norine to crowdsourcing. We believe that contributors from the scientific community are the best experts to annotate the NRPs they work on. We have developed MyNorine to facilitate the submission of new NRPs or modifications of stored ones. This article presents MyNorine and other novelties of Norine interface released since the first publication. Norine is freely accessible from the following URL: http://bioinfo.lifl.fr/NRP.


Assuntos
Bases de Dados de Compostos Químicos , Peptídeos/química , Peptídeos/farmacologia , Internet , Bases de Conhecimento , Anotação de Sequência Molecular , Peptídeos/metabolismo
7.
J Cheminform ; 7: 62, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26715946

RESUMO

BACKGROUND: The monomeric composition of polymers is powerful for structure comparison and synthetic biology, among others. Many databases give access to the atomic structure of compounds but the monomeric structure of polymers is often lacking. We have designed a smart algorithm, implemented in the tool Smiles2Monomers (s2m), to infer efficiently and accurately the monomeric structure of a polymer from its chemical structure. RESULTS: Our strategy is divided into two steps: first, monomers are mapped on the atomic structure by an efficient subgraph-isomorphism algorithm ; second, the best tiling is computed so that non-overlapping monomers cover all the structure of the target polymer. The mapping is based on a Markovian index built by a dynamic programming algorithm. The index enables s2m to search quickly all the given monomers on a target polymer. After, a greedy algorithm combines the mapped monomers into a consistent monomeric structure. Finally, a local branch and cut algorithm refines the structure. We tested this method on two manually annotated databases of polymers and reconstructed the structures de novo with a sensitivity over 90 %. The average computation time per polymer is 2 s. CONCLUSION: s2m automatically creates de novo monomeric annotations for polymers, efficiently in terms of time computation and sensitivity. s2m allowed us to detect annotation errors in the tested databases and to easily find the accurate structures. So, s2m could be integrated into the curation process of databases of small compounds to verify the current entries and accelerate the annotation of new polymers. The full method can be downloaded or accessed via a website for peptide-like polymers at http://bioinfo.lifl.fr/norine/smiles2monomers.jsp.Graphical abstract:.

8.
Methods Mol Biol ; 1269: 279-91, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25577385

RESUMO

Metatranscriptomic data contributes another piece of the puzzle to understanding the phylogenetic structure and function of a community of organisms. High-quality total RNA is a bountiful mixture of ribosomal, transfer, messenger and other noncoding RNAs, where each family of RNA is vital to answering questions concerning the hidden microbial world. Software tools designed for deciphering metatranscriptomic data fall under two main categories: the first is to reassemble millions of short nucleotide fragments produced by high-throughput sequencing technologies into the original full-length transcriptomes for all organisms within a sample, and the second is to taxonomically classify the organisms and determine their individual functional roles within a community. Species identification is mainly established using the ribosomal RNA genes, whereas the behavior and functionality of a community is revealed by the messenger RNA of the expressed genes. Numerous chemical and computational methods exist to separate families of RNA prior to conducting further downstream analyses, primarily suitable for isolating mRNA or rRNA from a total RNA sample. In this chapter, we demonstrate a computational technique for filtering rRNA from total RNA using the software SortMeRNA. Additionally, we propose a post-processing pipeline using the latest software tools to conduct further studies on the filtered data, including the reconstruction of mRNA transcripts for functional analyses and phylogenetic classification of a community using the ribosomal RNA.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , RNA Ribossômico 16S/genética , Software
9.
J Comput Biol ; 21(12): 947-63, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25393923

RESUMO

Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances, and to provide a lower misclassification rate when used with Support Vector Machines (SVMs). We confirm by independent experiments these two results, and propose in this article to use a coverage criterion to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.


Assuntos
Reconhecimento Automatizado de Padrão , Alinhamento de Sequência , Máquina de Vetores de Suporte , Algoritmos , Humanos , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos , Software
10.
Nucleic Acids Res ; 42(7): e59, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24493737

RESUMO

Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human-dog and melanogaster-pseudoobscura comparisons, but not for human-mouse, which suggests that we still miss many human-mouse alignments. Our optimized heuristics find ∼20,000 new human-mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.


Assuntos
Genoma Humano , Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Animais , Cães , Genoma , Humanos , Camundongos
11.
Bioinformatics ; 28(24): 3211-7, 2012 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-23071270

RESUMO

MOTIVATION: The application of next-generation sequencing (NGS) technologies to RNAs directly extracted from a community of organisms yields a mixture of fragments characterizing both coding and non-coding types of RNAs. The task to distinguish among these and to further categorize the families of messenger RNAs and ribosomal RNAs (rRNAs) is an important step for examining gene expression patterns of an interactive environment and the phylogenetic classification of the constituting species. RESULTS: We present SortMeRNA, a new software designed to rapidly filter rRNA fragments from metatranscriptomic data. It is capable of handling large sets of reads and sorting out all fragments matching to the rRNA database with high sensitivity and low running time.


Assuntos
Perfilação da Expressão Gênica , RNA Ribossômico/química , Software , Algoritmos , Ecossistema , Filogenia , RNA Mensageiro/química , RNA Ribossômico/biossíntese , RNA Ribossômico/classificação , RNA Ribossômico 16S/química , Análise de Sequência de RNA
12.
Artigo em Inglês | MEDLINE | ID: mdl-20936175

RESUMO

The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency.

13.
Algorithms Mol Biol ; 5(1): 6, 2010 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-20047662

RESUMO

BACKGROUND: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. RESULTS: We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at [http://bioinfo.lifl.fr/path/]. CONCLUSIONS: Our approach allows to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.

14.
Artigo em Inglês | MEDLINE | ID: mdl-19644175

RESUMO

We apply the concept of subset seeds to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard Blastp seeding method, as well as with the family of vector seeds. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds versus Blastp.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Proteínas/genética , Algoritmos , Sequência de Aminoácidos , Aminoácidos , Análise por Conglomerados , Modelos Biológicos , Curva ROC , Alinhamento de Sequência , Terminologia como Assunto
15.
BMC Bioinformatics ; 9: 534, 2008 Dec 16.
Artigo em Inglês | MEDLINE | ID: mdl-19087280

RESUMO

BACKGROUND: Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet. RESULTS: The paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated e-value parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. Supplementary data can be found on the website http://bioinfo.lifl.fr/reblosum. CONCLUSION: We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction.


Assuntos
Indexação e Redação de Resumos/métodos , Algoritmos , Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação , Proteínas/química
16.
Trends Genet ; 23(11): 543-6, 2007 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-17964682

RESUMO

By conventional wisdom, a feature that occurs too often or too rarely in a genome can indicate a functional element. To infer functionality from frequency, it is crucial to precisely characterize occurrences in randomly evolving DNA. We find that the frequency of oligonucleotides in a genomic sequence follows primarily a Pareto-lognormal distribution, which encapsulates lognormal and power-law features found across all known genomes. Such a distribution could be the result of completely random evolution by a copying process. Our characterization of the entire frequency distribution of genomic words opens a way to a more accurate reasoning about their over- and underrepresentation in genomic sequences.


Assuntos
Genômica , Animais , Evolução Molecular , Duplicação Gênica , Genoma , Humanos , Cadeias de Markov , Oligonucleotídeos/metabolismo
17.
J Bioinform Comput Biol ; 4(2): 553-69, 2006 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-16819802

RESUMO

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem--a set of target alignments, an associated probability distribution, and a seed model--that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.


Assuntos
Algoritmos , Inteligência Artificial , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Sequência de Bases , Dados de Sequência Molecular , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
18.
Nucleic Acids Res ; 33(Web Server issue): W540-3, 2005 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-15980530

RESUMO

YASS is a DNA local alignment tool based on an efficient and sensitive filtering algorithm. It applies transition-constrained seeds to specify the most probable conserved motifs between homologous sequences, combined with a flexible hit criterion used to identify groups of seeds that are likely to exhibit significant alignments. A web interface (http://www.loria.fr/projects/YASS/) is available to upload input sequences in fasta format, query the program and visualize the results obtained in several forms (dot-plot, tabular output and others). A standalone version is available for download from the web page.


Assuntos
Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Homologia de Sequência do Ácido Nucleico , Software , Algoritmos , Gráficos por Computador , Genômica/métodos , Internet , Interface Usuário-Computador
19.
Artigo em Inglês | MEDLINE | ID: mdl-17044164

RESUMO

We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Kärkkäinen. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.


Assuntos
Algoritmos , Etiquetas de Sequências Expressas , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Sequência de Bases , Dados de Sequência Molecular , Homologia de Sequência do Ácido Nucleico
20.
BMC Bioinformatics ; 5: 149, 2004 Oct 14.
Artigo em Inglês | MEDLINE | ID: mdl-15485572

RESUMO

BACKGROUND: The hit criterion is a key component of heuristic local alignment algorithms. It specifies a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. RESULTS: In this paper, we propose two ways to improve the hit criterion. First, we define the group criterion combining the advantages of the single-seed and double-seed approaches used in existing algorithms. Second, we introduce transition-constrained seeds that extend spaced seeds by the possibility of distinguishing transition and transversion mismatches. We provide analytical data as well as experimental results, obtained with the YASS software, supporting both improvements. CONCLUSIONS: Proposed algorithmic ideas allow to obtain a significant gain in sensitivity of similarity search without increase in execution time. The method has been implemented in YASS software available at http://www.loria.fr/projects/YASS/.


Assuntos
DNA/genética , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas , Algoritmos , Animais , Cromossomos Humanos X/genética , DNA Bacteriano/genética , DNA Fúngico/genética , Drosophila/genética , Humanos , Cadeias de Markov , Modelos Estatísticos , Neisseria meningitidis/genética , Saccharomyces cerevisiae/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...