Search | VHL Regional Portal

On pattern matching with k mismatches and few don't cares.

Nicolae, Marius; Rajasekaran, Sanguthevar.

Inf Process Lett ; 118: 78-82, 2017 Feb.

Article in English | MEDLINE | ID: mdl-28630523

ABSTRACT

We consider the problem of pattern matching with k mismatches, where there can be don't care or wild card characters in the pattern. Specifically, given a pattern P of length m and a text T of length n, we want to find all occurrences of P in T that have no more than k mismatches. The pattern can have don't care characters, which match any character. Without don't cares, the best known algorithm for pattern matching with k mismatches has a runtime of [Formula: see text]. With don't cares in the pattern, the best deterministic algorithm has a runtime of O(nk polylog m). Therefore, there is an important gap between the versions with and without don't cares. In this paper we give an algorithm whose runtime increases with the number of don't cares. We define an island to be a maximal length substring of P that does not contain don't cares. Let q be the number of islands in P. We present an algorithm that runs in [Formula: see text] time. If the number of islands q is O(k) this runtime becomes [Formula: see text], which essentially matches the best known runtime for pattern matching with k mismatches without don't cares. If the number of islands q is O(k2), this algorithm is asymptotically faster than the previous best algorithm for pattern matching with k mismatches with don't cares in the pattern.

Homology-Aware Phylogenomics at Gigabase Scales.

Sanderson, M J; Nicolae, Marius; McMahon, M M.

Syst Biol ; 66(4): 590-603, 2017 07 01.

Article in English | MEDLINE | ID: mdl-28123115

ABSTRACT

Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small $k$-mer strings holds promise to streamline and simplify these efforts, but existing approaches do not account well for gene tree discordance. We describe a "seed and extend" protocol that finds nearly exact matching sets of orthologous $k$-mers and extends them to construct data sets that can properly account for genomic heterogeneity. Exploiting an efficient suffix array data structure, sets of whole genomes can be parsed and converted into phylogenetic data matrices rapidly, with contiguous blocks of $k$-mers from the same chromosome, gene, or scaffold concatenated as needed. Phylogenetic trees constructed from highly curated rice genome data and a diverse set of six other eukaryotic whole genome, transcriptome, and organellar genome data sets recovered trees nearly identical to published phylogenomic analyses, in a small fraction of the time, and requiring many fewer parameter choices. Our method's ability to retain local homology information was demonstrated by using it to characterize gene tree discordance across the rice genome, and by its robustness to the high rate of interchromosomal gene transfer found in several rice species.

Subject(s)

Classification/methods , Genomics , Phylogeny , Gene Transfer, Horizontal/genetics , Oryza/classification , Oryza/genetics

LFQC: a lossless compression algorithm for FASTQ files.

Nicolae, Marius; Pathak, Sudipta; Rajasekaran, Sanguthevar.

Bioinformatics ; 31(20): 3276-81, 2015 Oct 15.

Article in English | MEDLINE | ID: mdl-26093148

ABSTRACT

MOTIVATION: Next Generation Sequencing (NGS) technologies have revolutionized genomic research by reducing the cost of whole genome sequencing. One of the biggest challenges posed by modern sequencing technology is economic storage of NGS data. Storing raw data is infeasible because of its enormous size and high redundancy. In this article, we address the problem of storage and transmission of large FASTQ files using innovative compression techniques. RESULTS: We introduce a new lossless non-reference based FASTQ compression algorithm named Lossless FASTQ Compressor. We have compared our algorithm with other state of the art big data compression algorithms namely gzip, bzip2, fastqz (Bonfield and Mahoney, 2013), fqzcomp (Bonfield and Mahoney, 2013), Quip (Jones et al., 2012), DSRC2 (Roguski and Deorowicz, 2014). This comparison reveals that our algorithm achieves better compression ratios on LS454 and SOLiD datasets. AVAILABILITY AND IMPLEMENTATION: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/rajasek/lfqc-v1.1.zip. CONTACT: rajasek@engr.uconn.edu.

Subject(s)

Algorithms , Data Compression/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Genomics , Information Storage and Retrieval

Corrigendum: qPMS9: an efficient algorithm for quorum planted motif search.

Nicolae, Marius; Rajasekaran, Sanguthevar.

Sci Rep ; 5: 9544, 2015 Mar 27.

Article in English | MEDLINE | ID: mdl-25827343

qPMS9: an efficient algorithm for quorum Planted Motif Search.

Nicolae, Marius; Rajasekaran, Sanguthevar.

Sci Rep ; 5: 7813, 2015 Jan 15.

Article in English | MEDLINE | ID: mdl-25589474

ABSTRACT

Discovering patterns in biological sequences is a crucial problem. For example, the identification of patterns in DNA sequences has resulted in the determination of open reading frames, identification of gene promoter elements, intron/exon splicing sites, and SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have led to domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, discovery of short functional motifs, etc. In this paper we focus on the identification of an important class of patterns, namely, motifs. We study the (â, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers â and d. It returns all sequences M of length â that occur in each input string, where each occurrence differs from M in at most d positions. Another formulation is quorum PMS (qPMS), where the motif appears in at least q% of the strings. We introduce qPMS9, a parallel exact qPMS algorithm that offers significant runtime improvements on DNA and protein datasets. qPMS9 solves the challenging DNA (â, d)-instances (28, 12) and (30, 13). The source code is available at https://code.google.com/p/qpms9/.

Subject(s)

Algorithms , Software , DNA/chemistry , Internet , Proteins/chemistry

An Elegant Algorithm for the Construction of Suffix Arrays.

Rajasekaran, Sanguthevar; Nicolae, Marius.

J Discrete Algorithms (Amst) ; 27: 21-28, 2014 Jul 01.

Article in English | MEDLINE | ID: mdl-25045344

ABSTRACT

The suffix array is a data structure that finds numerous applications in string processing problems for both linguistic texts and biological data. It has been introduced as a memory efficient alternative for suffix trees. The suffix array consists of the sorted suffixes of a string. There are several linear time suffix array construction algorithms (SACAs) known in the literature. However, one of the fastest algorithms in practice has a worst case run time of O(n2). The problem of designing practically and theoretically efficient techniques remains open. In this paper we present an elegant algorithm for suffix array construction which takes linear time with high probability; the probability is on the space of all possible inputs. Our algorithm is one of the simplest of the known SACAs and it opens up a new dimension of suffix array construction that has not been explored until now. Our algorithm is easily parallelizable. We offer parallel implementations on various parallel models of computing. We prove a lemma on the â-mers of a random string which might find independent applications. We also present another algorithm that utilizes the above algorithm. This algorithm is called RadixSA and has a worst case run time of O(n log n). RadixSA introduces an idea that may find independent applications as a speedup technique for other SACAs. An empirical comparison of RadixSA with other algorithms on various datasets reveals that our algorithm is one of the fastest algorithms to date. The C++ source code is freely available at http://www.engr.uconn.edu/~man09004/radixSA.zip.

Efficient sequential and parallel algorithms for planted motif search.

Nicolae, Marius; Rajasekaran, Sanguthevar.

BMC Bioinformatics ; 15: 34, 2014 Jan 31.

Article in English | MEDLINE | ID: mdl-24479443

ABSTRACT

BACKGROUND: Motif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences. One formulation of the problem is known as (l,d)-motif search or Planted Motif Search (PMS). In PMS we are given two integers l and d and n biological sequences. We want to find all sequences of length l that appear in each of the input sequences with at most d mismatches. The PMS problem is NP-complete. PMS algorithms are typically evaluated on certain instances considered challenging. Despite ample research in the area, a considerable performance gap exists because many state of the art algorithms have large runtimes even for moderately challenging instances. RESULTS: This paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithm to solve the challenging (l,d) instances (25,10) and (26,11). PMS8 is also efficient on instances with larger l and d such as (50,21). We include a comparison of PMS8 with several state of the art algorithms on multiple problem instances. This paper also presents necessary and sufficient conditions for 3 l-mers to have a common d-neighbor. The program is freely available at http://engr.uconn.edu/~man09004/PMS8/. CONCLUSIONS: We present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novel ideas for generating common neighborhoods. We have also implemented a parallel version for this algorithm. PMS8 can solve instances not solved by any previous algorithms.

Subject(s)

Computational Biology/methods , Sequence Analysis, DNA/methods , Sequence Analysis, Protein/methods , Software , Algorithms , DNA/chemistry , DNA/genetics , Proteins/chemistry , Proteins/genetics

Estimation of alternative splicing isoform frequencies from RNA-Seq data.

Nicolae, Marius; Mangul, Serghei; Mandoiu, Ion I; Zelikovsky, Alex.

Algorithms Mol Biol ; 6(1): 9, 2011 Apr 19.

Article in English | MEDLINE | ID: mdl-21504602

ABSTRACT

BACKGROUND: Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging. RESULTS: In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/. CONCLUSIONS: Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL