Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
BMC Bioinformatics ; 22(1): 51, 2021 Feb 06.
Artículo en Inglés | MEDLINE | ID: mdl-33549041

RESUMEN

BACKGROUND: An inverted repeat is a DNA sequence followed downstream by its reverse complement, potentially with a gap in the centre. Inverted repeats are found in both prokaryotic and eukaryotic genomes and they have been linked with countless possible functions. Many international consortia provide a comprehensive description of common genetic variation making alternative sequence representations, such as IUPAC encoding, necessary for leveraging the full potential of such broad variation datasets. RESULTS: We present IUPACPAL, an exact tool for efficient identification of inverted repeats in IUPAC-encoded DNA sequences allowing also for potential mismatches and gaps in the inverted repeats. CONCLUSION: Within the parameters that were tested, our experimental results show that IUPACPAL compares favourably to a similar application packaged with EMBOSS. We show that IUPACPAL identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.


Asunto(s)
Genoma , Células Procariotas , Secuencias Repetitivas de Ácidos Nucleicos , Secuencia de Bases , Secuencias Invertidas Repetidas , Secuencias Repetitivas de Ácidos Nucleicos/genética
2.
Bioinformatics ; 36(12): 3687-3692, 2020 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-32246826

RESUMEN

MOTIVATION: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. RESULTS: We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position. AVAILABILITY AND IMPLEMENTATION: GenMap can be installed via bioconda. Binaries and C++ source code are available on https://github.com/cpockrandt/genmap.


Asunto(s)
Genoma , Programas Informáticos , Algoritmos , Genómica , Análisis de Secuencia de ADN
3.
Genomics ; 102(4): 223-8, 2013 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-23831115

RESUMEN

The study of DNA sequence variation has been transformed by recent advances in DNA sequencing technologies. Determination of the functional consequences of sequence variant alleles offers potential insight as to how genotype may influence phenotype. Even within protein coding regions of the genome, establishing the consequences of variation on gene and protein function is challenging and requires substantial laboratory investigation. However, a series of bioinformatics tools have been developed to predict whether non-synonymous variants are neutral or disease-causing. In this study we evaluate the performance of nine such methods (SIFT, PolyPhen2, SNPs&GO, PhD-SNP, PANTHER, Mutation Assessor, MutPred, Condel and CAROL) and developed CoVEC (Consensus Variant Effect Classification), a tool that integrates the prediction results from four of these methods. We demonstrate that the CoVEC approach outperforms most individual methods and highlights the benefit of combining results from multiple tools.


Asunto(s)
Secuencia de Bases , Biología Computacional/métodos , Variación Genética , Algoritmos , Animales , Genoma , Genotipo , Humanos , Sistemas de Lectura Abierta , Fenotipo , Polimorfismo de Nucleótido Simple
4.
Genomics ; 101(2): 120-4, 2013 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-23195409

RESUMEN

Several studies on adult tissues agree on the presence of a positive effect of the genomic and genic base composition on mammalian gene expression. Recent literature supports the idea that during developmental processes GC-poor genomic regions are preferentially implicated. We investigate the relationship between the compositional properties of the isochores and of the genes with their respective expression activity during developmental processes. Using RNA-seq data from two distinct developmental stages of the mouse cortex, embryonic day 18 (E18) and postnatal day 7 (P7), we established for the first time a developmental-related transcriptome map of the mouse isochores. Additionally, for each stage we estimated the correlation between isochores' GC level and their expression activity, and the genes' expression patterns for each isochore family. Our analyses add evidence supporting the idea that during development GC-poor isochores are preferentially implicated, and confirm the positive effect of genes' GC level on their expression activity.


Asunto(s)
Encéfalo/metabolismo , Regulación del Desarrollo de la Expresión Génica , Isocoras/genética , Transcriptoma , Animales , Composición de Base , Encéfalo/embriología , Mapeo Cromosómico , Embrión de Mamíferos , Biblioteca de Genes , Ratones , Análisis de Secuencia de ARN
5.
BMC Bioinformatics ; 14 Suppl 8: S2, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23815711

RESUMEN

A weighted biological sequence is a string in which a set of characters may appear at each position with respective probabilities of occurrence. We attempt to locate all the tandem repeats in a weighted sequence. A repeated substring is called a tandem repeat if each occurrence of the substring is directly adjacent to each other. By introducing the idea of equivalence classes in weighted sequences, we identify the tandem repeats of every possible length using an iterative partitioning technique. We also present the algorithm for recording the tandem repeats, and prove that the problem can be solved in O(n²) time.


Asunto(s)
Algoritmos , Alineación de Secuencia/métodos , Secuencias Repetidas en Tándem , Proteínas/química , Proteínas/genética
6.
BMC Genomics ; 12: 511, 2011 Oct 17.
Artículo en Inglés | MEDLINE | ID: mdl-22004510

RESUMEN

BACKGROUND: The availability of fully sequenced genomes and the implementation of transcriptome technologies have increased the studies investigating the expression profiles for a variety of tissues, conditions, and species. In this study, using RNA-seq data for three distinct tissues (brain, liver, and muscle), we investigate how base composition affects mammalian gene expression, an issue of prime practical and evolutionary interest. RESULTS: We present the transcriptome map of the mouse isochores (DNA segments with a fairly homogeneous base composition) for the three different tissues and the effects of isochores' base composition on their expression activity. Our analyses also cover the relations between the genes' expression activity and their localization in the isochore families. CONCLUSIONS: This study is the first where next-generation sequencing data are used to associate the effects of both genomic and genic compositional properties to their corresponding expression activity. Our findings confirm previous results, and further support the existence of a relationship between isochores and gene expression. This relationship corroborates that isochores are primarily a product of evolutionary adaptation rather than a simple by-product of neutral evolutionary processes.


Asunto(s)
Isocoras/genética , Transcriptoma , Animales , Composición de Base , Encéfalo/metabolismo , Genoma , Hígado/metabolismo , Ratones , Ratones Endogámicos C57BL , Músculos/metabolismo , Análisis de Secuencia de ARN
7.
Adv Exp Med Biol ; 680: 399-403, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-20865524

RESUMEN

Novel high-throughput (Deep) sequencing technology methods have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads) in a single experiment and with a much lower cost than previous sequencing methods. In this paper, we present a new algorithm for addressing the problem of efficiently mapping millions of short reads to a reference genome. In particular, we define and solve the Massive Approximate Pattern Matching problem for mapping short sequences to a reference genome.


Asunto(s)
Algoritmos , Mapeo Cromosómico/estadística & datos numéricos , Genómica/estadística & datos numéricos , Alineación de Secuencia/estadística & datos numéricos , Animales , Biología Computacional , Ratones , Reconocimiento de Normas Patrones Automatizadas/estadística & datos numéricos , ARN/genética , Cromosoma X/genética
8.
J Comput Biol ; 14(7): 1001-10, 2007 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-17803376

RESUMEN

The number of segregating sites provides an indicator of the degree of DNA sequence variation that is present in a sample, and has been of great interest to the biological, pharmaceutical and medical professions. In this paper, we first provide linear- and expected-sublinear-time algorithms for finding all the segregating sites of a given set of DNA sequences. We also describe a data structure for tracking segregating sites in a set of sequences, such that every time the set is updated with the insertion of a new sequence or removal of an existing one, the segregating sites are updated accordingly without the need to re-scan the entire set of sequences.


Asunto(s)
Algoritmos , Secuencia de Bases , Genoma , Variación Genética , Análisis de Secuencia de ADN
9.
Algorithms Mol Biol ; 12: 5, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-28293277

RESUMEN

BACKGROUND: The deviation of the observed frequency of a word w from its expected frequency in a given sequence x is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the deviation of w, denoted by [Formula: see text], effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word w of length [Formula: see text] is a [Formula: see text]-avoided word in x if [Formula: see text], for a given threshold [Formula: see text]. Notice that such a word may be completely absent from x. Hence, computing all such words naïvely can be a very time-consuming procedure, in particular for large k. RESULTS: In this article, we propose an [Formula: see text]-time and [Formula: see text]-space algorithm to compute all [Formula: see text]-avoided words of length k in a given sequence of length n over a fixed-sized alphabet. We also present a time-optimal [Formula: see text]-time algorithm to compute all [Formula: see text]-avoided words (of any length) in a sequence of length n over an integer alphabet of size [Formula: see text]. In addition, we provide a tight asymptotic upper bound for the number of [Formula: see text]-avoided words over an integer alphabet and the expected length of the longest one. We make available an implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency and applicability of our implementation in biological sequence analysis. CONCLUSIONS: The systematic search for avoided words is particularly useful for biological sequence analysis. We present a linear-time and linear-space algorithm for the computation of avoided words of length k in a given sequence x. We suggest a modification to this algorithm so that it computes all avoided words of x, irrespective of their length, within the same time complexity. We also present combinatorial results with regards to avoided words and absent words.

10.
IEEE Trans Nanobioscience ; 15(2): 93-100, 2016 03.
Artículo en Inglés | MEDLINE | ID: mdl-26992174

RESUMEN

This paper deals with the approximate version of the circular pattern matching (ACPM) problem, which appears as an interesting problem in many biological contexts. The circular pattern matching problem consists in finding all occurrences of the rotations of a pattern P of length m in a text T of length n. In ACPM, we consider occurrences with k -mismatches under the Hamming distance model. In this paper, we present a simple and fast filter-based algorithm to solve the ACPM problem. We compare our algorithm with the state of the art algorithms and the results are found to be excellent. In particular, our algorithm runs almost twice as fast than the state of the art. Much of the efficiency of our algorithm can be attributed to its filters that are effective but extremely simple and lightweight.


Asunto(s)
Algoritmos , Biología Computacional/métodos , ADN Circular , Reconocimiento de Normas Patrones Automatizadas/métodos , Análisis de Secuencia de ADN/métodos , ADN Circular/análisis , ADN Circular/química , ADN Circular/genética
11.
FEBS Lett ; 590(14): 2297-306, 2016 07.
Artículo en Inglés | MEDLINE | ID: mdl-27279593

RESUMEN

This work investigates the role of isochores during preimplantation process. Using RNA-seq data from human and mouse preimplantation stages, we created the spatio-temporal transcriptional profiles of the isochores during preimplantation. We found that from early to late stages, GC-rich isochores increase their expression while GC-poor ones decrease it. Network analysis revealed that modules with few coexpressed isochores are GC-poorer than medium-large ones, characterized by an opposite expression as preimplantation advances, decreasing and increasing respectively. Our results reveal a functional contribution of the isochores, supporting the presence of structural-functional interactions during maturation and early-embryonic development.


Asunto(s)
Blastocisto/metabolismo , Regulación del Desarrollo de la Expresión Génica/fisiología , Isocoras/metabolismo , Transcriptoma/fisiología , Animales , Humanos , Ratones , Especificidad de la Especie
12.
Algorithms Mol Biol ; 11: 12, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27168761

RESUMEN

BACKGROUND: Sequence comparison is a fundamental step in many important tasks in bioinformatics; from phylogenetic reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular molecular structure is a common phenomenon in nature, a caveat of the adaptation of alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. RESULTS: In this paper, we introduce a new distance measure based on q-grams, and show how it can be applied effectively and computed efficiently for circular sequence comparison. Experimental results, using real DNA, RNA, and protein sequences as well as synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.

13.
Algorithms Mol Biol ; 11: 21, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27471546

RESUMEN

[This corrects the article DOI: 10.1186/s13015-016-0076-6.].

14.
Int J Genomics ; 2015: 259320, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26557649

RESUMEN

This paper deals with the circular pattern matching (CPM) problem, which appears as an interesting problem in many biological contexts. CPM consists in finding all occurrences of the rotations of a pattern 𝒫 of length m in a text 𝒯 of length n. In this paper, we present SimpLiFiCPM (pronounced "Simplify CPM"), a simple and lightweight filter-based algorithm to solve the problem. We compare our algorithm with the state-of-the-art algorithms and the results are found to be excellent. Much of the speed of our algorithm comes from the fact that our filters are effective but extremely simple and lightweight.

15.
Algorithms Mol Biol ; 9: 21, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25221616

RESUMEN

BACKGROUND: Tandem duplication, in the context of molecular biology, occurs as a result of mutational events in which an original segment of DNA is converted into a sequence of individual copies. More formally, a repetition or tandem repeat in a string of letters consists of exact concatenations of identical factors of the string. Biologists are interested in approximate tandem repeats and not necessarily only in exact tandem repeats. A weighted sequence is a string in which a set of letters may occur at each position with respective probabilities of occurrence. It naturally arises in many biological contexts and provides a method to realise the approximation among distinct adjacent occurrences of the same DNA segment. RESULTS: Crochemore's repetitions algorithm, also referred to as Crochemore's partitioning algorithm, was introduced in 1981, and was the first optimal [Formula: see text]-time algorithm to compute all repetitions in a string of length n. In this article, we present a novel variant of Crochemore's partitioning algorithm for weighted sequences, which requires optimal [Formula: see text] time, thus improving on the best known [Formula: see text]-time algorithm (Zhang et al., 2013) for computing all repetitions in a weighted sequence of length n.

16.
Algorithms Mol Biol ; 9(1): 9, 2014 Mar 22.
Artículo en Inglés | MEDLINE | ID: mdl-24656145

RESUMEN

BACKGROUND: Circular string matching is a problem which naturally arises in many biological contexts. It consists in finding all occurrences of the rotations of a pattern of length m in a text of length n. There exist optimal average-case algorithms for exact circular string matching. Approximate circular string matching is a rather undeveloped area. RESULTS: In this article, we present a suboptimal average-case algorithm for exact circular string matching requiring time O(n). Based on our solution for the exact case, we present two fast average-case algorithms for approximate circular string matching with k-mismatches, under the Hamming distance model, requiring time O(n) for moderate values of k, that is k=O(m/logm). We show how the same results can be easily obtained under the edit distance model. The presented algorithms are also implemented as library functions. Experimental results demonstrate that the functions provided in this library accelerate the computations by more than three orders of magnitude compared to a naïve approach. CONCLUSIONS: We present two fast average-case algorithms for approximate circular string matching with k-mismatches; and show that they also perform very well in practice. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any biological pipeline. The source code of the library is freely available at http://www.inf.kcl.ac.uk/research/projects/asmf/.

17.
Int J Comput Biol Drug Des ; 6(1-2): 119-30, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23428478

RESUMEN

In this paper, we present a solution to the extreme similarity sequencing problem. The extreme similarity sequencing problem consists of finding occurrences of a pattern p in a set S(0), S(1), …, S(k), of sequences of equal length, where S(i), for all 1≤i≤k, differs from S(0) by a constant number of errors - around 10 in practice. We present an asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O(nk/w) time algorithm for solving this problem, where n is the length of a sequence, occ is the number of candidate occurrences reported by our technique, w is the size of the machine word, and the total number of errors is bounded by k - the number of sequences.


Asunto(s)
Algoritmos , Análisis de Secuencia de ADN/métodos , Homología de Secuencia de Ácido Nucleico , Biología Computacional , Secuenciación de Nucleótidos de Alto Rendimiento
18.
Recent Pat DNA Gene Seq ; 7(2): 84-95, 2013 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-22974258

RESUMEN

MOTIVATION: Pairwise sequence alignment has received a new motivation due to the advent of recent patents in next-generation sequencing technologies, particularly so for the application of re-sequencing---the assembly of a genome directed by a reference sequence. After the fast alignment between a factor of the reference sequence and a high-quality fragment of a short read by a short-read alignment programme, an important problem is to find the alignment between a relatively short succeeding factor of the reference sequence and the remaining low-quality part of the read allowing a number of mismatches and the insertion of a single gap in the alignment. RESULTS: We present GapMis, a tool for pairwise sequence alignment with a single gap. It is based on a simple algorithm, which computes a different version of the traditional dynamic programming matrix. The presented experimental results demonstrate that GapMis is more suitable and efficient than most popular tools for this task.


Asunto(s)
Algoritmos , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento , Patentes como Asunto , Alineación de Secuencia
20.
Protein Pept Lett ; 17(9): 1136-42, 2010 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-20509856

RESUMEN

A weighted sequence is a string in which a set of characters may appear at each position with respective probabilities of occurrence. Weighted sequences are able to summarize poorly defined short sequences, as well as the profiles of protein families and complete chromosome sequences. Thus it is of biological and theoretical significance to design powerful algorithms on weighted sequences. A common task is to identify repetitive motifs in weighted sequences, with presence probability not less than a given threshold. We define two types of repeats in weighted sequences, called the loose repeats and the strict repeats, respectively, and then attempt to locate these repeats. Using an iterative partitioning technique, we present algorithms for computing all the loose repeats and strict repeats of every length, respectively. Each solution costs O(n(2)) time.


Asunto(s)
Biología Computacional/métodos , Proteínas/química , Algoritmos , Secuencias Repetitivas de Aminoácido
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA