Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 86
Filtrar
1.
J Phys Chem A ; 127(40): 8437-8446, 2023 Oct 12.
Artigo em Inglês | MEDLINE | ID: mdl-37773038

RESUMO

Machine learning models are widely used in science and engineering to predict the properties of materials and solve complex problems. However, training large models can take days and fine-tuning hyperparameters can take months, making it challenging to achieve optimal performance. To address this issue, we propose a Knowledge Enhancing (KE) algorithm that enhances knowledge gained from a lower capacity model to a higher capacity model, enhancing training efficiency and performance. We focus on the problem of predicting the bandgap of an unknown material and present a theoretical analysis and experimental verification of our algorithm. Our experiments show that the performance of our knowledge enhancement model is improved by at least 10.21% compared to current methods on OMDB datasets. We believe that our generic idea of knowledge enhancement will be useful for solving other problems and provide a promising direction for future research.

2.
Hum Genomics ; 15(1): 66, 2021 11 09.
Artigo em Inglês | MEDLINE | ID: mdl-34753514

RESUMO

BACKGROUND: Nowadays we are observing an explosion of gene expression data with phenotypes. It enables us to accurately identify genes responsible for certain medical condition as well as classify them for drug target. Like any other phenotype data in medical domain, gene expression data with phenotypes also suffer from being a very underdetermined system. In a very large set of features but a very small sample size domain (e.g. DNA microarray, RNA-seq data, GWAS data, etc.), it is often reported that several contrasting feature subsets may yield near equally optimal results. This phenomenon is known as instability. Considering these facts, we have developed a robust and stable supervised gene selection algorithm to select a set of robust and stable genes having a better prediction ability from the gene expression datasets with phenotypes. Stability and robustness is ensured by class and instance level perturbations, respectively. RESULTS: We have performed rigorous experimental evaluations using 10 real gene expression microarray datasets with phenotypes. They reveal that our algorithm outperforms the state-of-the-art algorithms with respect to stability and classification accuracy. We have also performed biological enrichment analysis based on gene ontology-biological processes (GO-BP) terms, disease ontology (DO) terms, and biological pathways. CONCLUSIONS: It is indisputable from the results of the performance evaluations that our proposed method is indeed an effective and efficient supervised gene selection algorithm.


Assuntos
Algoritmos , Aprendizado de Máquina , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Fenótipo
3.
J Biomed Inform ; 130: 104094, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35550929

RESUMO

Record linkage is an important problem studied widely in many domains including biomedical informatics. A standard version of this problem is to cluster records from several datasets, such that each cluster has records pertinent to just one individual. Typically, datasets are huge in size. Hence, existing record linkage algorithms take a very long time. It is thus essential to develop novel fast algorithms for record linkage. The incremental version of this problem is to link previously clustered records with new records added to the input datasets. A novel algorithm has been created to efficiently perform standard and incremental record linkage. This algorithm leverages a set of efficient techniques that significantly restrict the number of record pair comparisons and distance computations. Our algorithm shows an average speed-up of 2.4x (up to 4x) for the standard linkage problem as compared to the state-of-the-art, without any drop in linkage performance at all. On average, our algorithm can incrementally link records in just 33% of the time required for linking them from scratch. Our algorithms achieve comparable or superior linkage performance and outperform the state-of-the-art in terms of linking time in all cases where the number of comparison attributes is greater than two. In practice, more than two comparison attributes are quite common. The proposed algorithm is very efficient and could be used in practice for record linkage applications especially when records are being added over time and linkage output needs to be updated frequently.


Assuntos
Algoritmos , Registro Médico Coordenado , Registro Médico Coordenado/métodos
4.
Bioinformatics ; 35(9): e1-e7, 2019 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31051040

RESUMO

MOTIVATION: Next-generation sequencing (NGS) technologies have revolutionized genomic research by reducing the cost of whole-genome sequencing. One of the biggest challenges posed by modern sequencing technology is economic storage of NGS data. Storing raw data is infeasible because of its enormous size and high redundancy. In this article, we address the problem of storage and transmission of large Fastq files using innovative compression techniques. RESULTS: We introduce a new lossless non-reference-based fastq compression algorithm named lossless FastQ compressor. We have compared our algorithm with other state of the art big data compression algorithms namely gzip, bzip2, fastqz, fqzcomp, G-SQZ, SCALCE, Quip, DSRC, DSRC-LZ etc. This comparison reveals that our algorithm achieves better compression ratios. The improvement obtained is up to 225%. For example, on one of the datasets (SRR065390_1), the average improvement (over all the algorithms compared) is 74.62%. AVAILABILITY AND IMPLEMENTATION: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/∼rajasek/FastqPrograms.zip.

5.
Bioinformatics ; 35(17): 2932-2940, 2019 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30649204

RESUMO

MOTIVATION: Metagenomics is the study of genetic materials directly sampled from natural habitats. It has the potential to reveal previously hidden diversity of microscopic life largely due to the existence of highly parallel and low-cost next-generation sequencing technology. Conventional approaches align metagenomic reads onto known reference genomes to identify microbes in the sample. Since such a collection of reference genomes is very large, the approach often needs high-end computing machines with large memory which is not often available to researchers. Alternative approaches follow an alignment-free methodology where the presence of a microbe is predicted using the information about the unique k-mers present in the microbial genomes. However, such approaches suffer from high false positives due to trading off the value of k with the computational resources. In this article, we propose a highly efficient metagenomic sequence classification (MSC) algorithm that is a hybrid of both approaches. Instead of aligning reads to the full genomes, MSC aligns reads onto a set of carefully chosen, shorter and highly discriminating model sequences built from the unique k-mers of each of the reference sequences. RESULTS: Microbiome researchers are generally interested in two objectives of a taxonomic classifier: (i) to detect prevalence, i.e. the taxa present in a sample, and (ii) to estimate their relative abundances. MSC is primarily designed to detect prevalence and experimental results show that MSC is indeed a more effective and efficient algorithm compared to the other state-of-the-art algorithms in terms of accuracy, memory and runtime. Moreover, MSC outputs an approximate estimate of the abundances. AVAILABILITY AND IMPLEMENTATION: The implementations are freely available for non-commercial purposes. They can be downloaded from https://drive.google.com/open?id=1XirkAamkQ3ltWvI1W1igYQFusp9DHtVl.


Assuntos
Metagenoma , Metagenômica , Análise de Sequência de DNA , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala
6.
Nucleic Acids Res ; 46(D1): D465-D470, 2018 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-29140456

RESUMO

Minimotif Miner (MnM) is a database and web system for analyzing short functional peptide motifs, termed minimotifs. We present an update to MnM growing the database from ∼300 000 to >1 000 000 minimotif consensus sequences and instances. This growth comes largely from updating data from existing databases and annotation of articles with high-throughput approaches analyzing different types of post-translational modifications. Another update is mapping human proteins and their minimotifs to know human variants from the dbSNP, build 150. Now MnM 4 can be used to generate mechanistic hypotheses about how human genetic variation affect minimotifs and outcomes. One example of the utility of the combined minimotif/SNP tool identifies a loss of function missense SNP in a ubiquitylation minimotif encoded in the excision repair cross-complementing 2 (ERCC2) nucleotide excision repair gene. This SNP reaches genome wide significance for many types of cancer and the variant identified with MnM 4 reveals a more detailed mechanistic hypothesis concerning the role of ERCC2 in cancer. Other updates to the web system include a new architecture with migration of the web system and database to Docker containers for better performance and management. Weblinks:minimotifminer.org and mnm.engr.uconn.edu.


Assuntos
Bases de Dados de Proteínas , Peptídeos/química , Processamento de Proteína Pós-Traducional , Receptores Acoplados a Proteínas G/química , Software , Proteína Grupo D do Xeroderma Pigmentoso/química , Sequência de Aminoácidos , Sítios de Ligação , Sequência Consenso , Ontologia Genética , Genoma Humano , Humanos , Internet , Modelos Moleculares , Anotação de Sequência Molecular , Neoplasias/genética , Neoplasias/metabolismo , Neoplasias/patologia , Peptídeos/genética , Peptídeos/metabolismo , Polimorfismo de Nucleotídeo Único , Ligação Proteica , Domínios e Motivos de Interação entre Proteínas , Receptores Acoplados a Proteínas G/genética , Receptores Acoplados a Proteínas G/metabolismo , Alinhamento de Sequência , Proteína Grupo D do Xeroderma Pigmentoso/genética , Proteína Grupo D do Xeroderma Pigmentoso/metabolismo
7.
BMC Genomics ; 20(Suppl 5): 424, 2019 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-31167665

RESUMO

BACKGROUND: Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Several motif models have been proposed in the literature. The (l,d)-motif model is one of these that has been studied widely. However, this model will sometimes report too many spurious motifs than expected. We interpret a motif as a biologically significant entity that is evolutionarily preserved within some distance. It may be highly improbable that the motif undergoes the same number of changes in each of the species. To address this issue, in this paper, we introduce a new model which is more general than (l,d)-motif model. This model is called (l,d1,d2)-motif model (LDDMS) and is NP-hard as well. We present three elegant as well as efficient algorithms to solve the LDDMS problem, i.e., LDDMS1, LDDMS2 and LDDMS3. They are all exact algorithms. RESULTS: We did both theoretical analyses and empirical tests on these algorithms. Theoretical analyses demonstrate that our algorithms have less computational cost than the pattern driven approach. Empirical results on both simulated datasets and real datasets show that each of the three algorithms has some advantages on some (l,d1,d2) instances. CONCLUSIONS: We proposed LDDMS model which is more practically relevant. We also proposed three exact efficient algorithms to solve the problem. Besides, our algorithms can be nicely parallelized. We believe that the idea in this new model can also be extended to other motif search problems such as Edit-distance-based Motif Search (EMS) and Simple Motif Search (SMS).


Assuntos
Algoritmos , Motivos de Aminoácidos , Motivos de Nucleotídeos , Biologia Computacional , Humanos , Modelos Teóricos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos
8.
Bioinformatics ; 32(7): 1118-9, 2016 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-26722114

RESUMO

CONTACT: subrata.saha@engr.uconn.edu or rajasek@engr.uconn.edu.


Assuntos
Compressão de Dados , Genoma , Algoritmos , Humanos
9.
Bioinformatics ; 32(22): 3405-3412, 2016 11 15.
Artigo em Inglês | MEDLINE | ID: mdl-27485445

RESUMO

MOTIVATION: Next-generation sequencing techniques produce millions to billions of short reads. The procedure is not only very cost effective but also can be done in laboratory environment. The state-of-the-art sequence assemblers then construct the whole genomic sequence from these reads. Current cutting edge computing technology makes it possible to build genomic sequences from the billions of reads within a minimal cost and time. As a consequence, we see an explosion of biological sequences in recent times. In turn, the cost of storing the sequences in physical memory or transmitting them over the internet is becoming a major bottleneck for research and future medical applications. Data compression techniques are one of the most important remedies in this context. We are in need of suitable data compression algorithms that can exploit the inherent structure of biological sequences. Although standard data compression algorithms are prevalent, they are not suitable to compress biological sequencing data effectively. In this article, we propose a novel referential genome compression algorithm (NRGC) to effectively and efficiently compress the genomic sequences. RESULTS: We have done rigorous experiments to evaluate NRGC by taking a set of real human genomes. The simulation results show that our algorithm is indeed an effective genome compression algorithm that performs better than the best-known algorithms in most of the cases. Compression and decompression times are also very impressive. AVAILABILITY AND IMPLEMENTATION: The implementations are freely available for non-commercial purposes. They can be downloaded from: http://www.engr.uconn.edu/~rajasek/NRGC.zip CONTACT: rajasek@engr.uconn.edu.


Assuntos
Algoritmos , Genoma , Genômica , Animais , Compressão de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
10.
Bioinformatics ; 32(18): 2783-90, 2016 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-27283950

RESUMO

MOTIVATION: A massive number of bioinformatics applications require counting of k-length substrings in genetically important long strings. A k-mer counter generates the frequencies of each k-length substring in genome sequences. Genome assembly, repeat detection, multiple sequence alignment, error detection and many other related applications use a k-mer counter as a building block. Very fast and efficient algorithms are necessary to count k-mers in large data sets to be useful in such applications. RESULTS: We propose a novel trie-based algorithm for this k-mer counting problem. We compare our devised algorithm k-mer Counter based on Multiple Burst Trees (KCMBT) with available all well-known algorithms. Our experimental results show that KCMBT is around 30% faster than the previous best-performing algorithm KMC2 for human genome dataset. As another example, our algorithm is around six times faster than Jellyfish2. Overall, KCMBT is 20-30% faster than KMC2 on five benchmark data sets when both the algorithms were run using multiple threads. AVAILABILITY AND IMPLEMENTATION: KCMBT is freely available on GitHub: (https://github.com/abdullah009/kcmbt_mt). CONTACT: rajasek@engr.uconn.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Alinhamento de Sequência , Análise de Sequência de DNA , Sequência de Bases , Biologia Computacional/métodos , Genoma , Humanos , Software
11.
Nucleic Acids Res ; 43(13): 6399-412, 2015 Jul 27.
Artigo em Inglês | MEDLINE | ID: mdl-26068475

RESUMO

Since the function of a short contiguous peptide minimotif can be introduced or eliminated by a single point mutation, these functional elements may be a source of human variation and a target of selection. We analyzed the variability of ∼300 000 minimotifs in 1092 human genomes from the 1000 Genomes Project. Most minimotifs have been purified by selection, with a 94% invariance, which supports important functional roles for minimotifs. Minimotifs are generally under negative selection, possessing high genomic evolutionary rate profiling (GERP) and sitewise likelihood-ratio (SLR) scores. Some are subject to neutral drift or positive selection, similar to coding regions. Most SNPs in minimotif were common variants, but with minor allele frequencies generally <10%. This was supported by low substation rates and few newly derived minimotifs. Several minimotif alleles showed different intercontinental and regional geographic distributions, strongly suggesting a role for minimotifs in adaptive evolution. We also note that 4% of PTM minimotif sites in histone tails were common variants, which has the potential to differentially affect DNA packaging among individuals. In conclusion, minimotifs are a source of functional genetic variation in the human population; thus, they are likely to be an important target of selection and evolution.


Assuntos
Motivos de Aminoácidos/genética , Evolução Molecular , Animais , Genoma Humano , Histonas/química , Humanos , Polimorfismo Genético
12.
Inf Process Lett ; 118: 78-82, 2017 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-28630523

RESUMO

We consider the problem of pattern matching with k mismatches, where there can be don't care or wild card characters in the pattern. Specifically, given a pattern P of length m and a text T of length n, we want to find all occurrences of P in T that have no more than k mismatches. The pattern can have don't care characters, which match any character. Without don't cares, the best known algorithm for pattern matching with k mismatches has a runtime of [Formula: see text]. With don't cares in the pattern, the best deterministic algorithm has a runtime of O(nk polylog m). Therefore, there is an important gap between the versions with and without don't cares. In this paper we give an algorithm whose runtime increases with the number of don't cares. We define an island to be a maximal length substring of P that does not contain don't cares. Let q be the number of islands in P. We present an algorithm that runs in [Formula: see text] time. If the number of islands q is O(k) this runtime becomes [Formula: see text], which essentially matches the best known runtime for pattern matching with k mismatches without don't cares. If the number of islands q is O(k2), this algorithm is asymptotically faster than the previous best algorithm for pattern matching with k mismatches with don't cares in the pattern.

13.
BMC Genomics ; 17 Suppl 4: 465, 2016 08 18.
Artigo em Inglês | MEDLINE | ID: mdl-27557423

RESUMO

BACKGROUND: Motif search is an important step in extracting meaningful patterns from biological data. The general problem of motif search is intractable and there is a pressing need to develop efficient, exact and approximation algorithms to solve this problem. In this paper, we present several novel, exact, sequential and parallel algorithms for solving the (l,d) Edit-distance-based Motif Search (EMS) problem: given two integers l,d and n biological strings, find all strings of length l that appear in each input string with atmost d errors of types substitution, insertion and deletion. METHODS: One popular technique to solve the problem is to explore for each input string the set of all possible l-mers that belong to the d-neighborhood of any substring of the input string and output those which are common for all input strings. We introduce a novel and provably efficient neighborhood exploration technique. We show that it is enough to consider the candidates in neighborhood which are at a distance exactly d. We compactly represent these candidate motifs using wildcard characters and efficiently explore them with very few repetitions. Our sequential algorithm uses a trie based data structure to efficiently store and sort the candidate motifs. Our parallel algorithm in a multi-core shared memory setting uses arrays for storing and a novel modification of radix-sort for sorting the candidate motifs. RESULTS: The algorithms for EMS are customarily evaluated on several challenging instances such as (8,1), (12,2), (16,3), (20,4), and so on. The best previously known algorithm, EMS1, is sequential and in estimated 3 days solves up to instance (16,3). Our sequential algorithms are more than 20 times faster on (16,3). On other hard instances such as (9,2), (11,3), (13,4), our algorithms are much faster. Our parallel algorithm has more than 600 % scaling performance while using 16 threads. CONCLUSIONS: Our algorithms have pushed up the state-of-the-art of EMS solvers and we believe that the techniques introduced in this paper are also applicable to other motif search problems such as Planted Motif Search (PMS) and Simple Motif Search (SMS).


Assuntos
Algoritmos , Motivos de Aminoácidos/genética , Motivos de Nucleotídeos/genética , Software , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos
14.
Bioinformatics ; 31(21): 3468-75, 2015 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-26139636

RESUMO

MOTIVATION: Genome sequencing has become faster and more affordable. Consequently, the number of available complete genomic sequences is increasing rapidly. As a result, the cost to store, process, analyze and transmit the data is becoming a bottleneck for research and future medical applications. So, the need for devising efficient data compression and data reduction techniques for biological sequencing data is growing by the day. Although there exists a number of standard data compression algorithms, they are not efficient in compressing biological data. These generic algorithms do not exploit some inherent properties of the sequencing data while compressing. To exploit statistical and information-theoretic properties of genomic sequences, we need specialized compression algorithms. Five different next-generation sequencing data compression problems have been identified and studied in the literature. We propose a novel algorithm for one of these problems known as reference-based genome compression. RESULTS: We have done extensive experiments using five real sequencing datasets. The results on real genomes show that our proposed algorithm is indeed competitive and performs better than the best known algorithms for this problem. It achieves compression ratios that are better than those of the currently best performing algorithms. The time to compress and decompress the whole genome is also very promising. AVAILABILITY AND IMPLEMENTATION: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/∼rajasek/ERGC.zip. CONTACT: rajasek@engr.uconn.edu.


Assuntos
Algoritmos , Compressão de Dados/métodos , Genoma Humano , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Mapeamento Cromossômico , Bases de Dados Factuais , Humanos , Armazenamento e Recuperação da Informação
15.
Bioinformatics ; 31(20): 3276-81, 2015 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-26093148

RESUMO

MOTIVATION: Next Generation Sequencing (NGS) technologies have revolutionized genomic research by reducing the cost of whole genome sequencing. One of the biggest challenges posed by modern sequencing technology is economic storage of NGS data. Storing raw data is infeasible because of its enormous size and high redundancy. In this article, we address the problem of storage and transmission of large FASTQ files using innovative compression techniques. RESULTS: We introduce a new lossless non-reference based FASTQ compression algorithm named Lossless FASTQ Compressor. We have compared our algorithm with other state of the art big data compression algorithms namely gzip, bzip2, fastqz (Bonfield and Mahoney, 2013), fqzcomp (Bonfield and Mahoney, 2013), Quip (Jones et al., 2012), DSRC2 (Roguski and Deorowicz, 2014). This comparison reveals that our algorithm achieves better compression ratios on LS454 and SOLiD datasets. AVAILABILITY AND IMPLEMENTATION: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/rajasek/lfqc-v1.1.zip. CONTACT: rajasek@engr.uconn.edu.


Assuntos
Algoritmos , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Genômica , Armazenamento e Recuperação da Informação
16.
BMC Bioinformatics ; 16 Suppl 17: S2, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26678663

RESUMO

BACKGROUND: In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. RESULTS: We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. CONCLUSIONS: Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. SOFTWARE AVAILABILITY: The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip.


Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Simulação por Computador , Bases de Dados de Ácidos Nucleicos , Sequenciamento de Nucleotídeos em Larga Escala/métodos
17.
BMC Bioinformatics ; 16 Suppl 5: S11, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25859612

RESUMO

Metabolomics is the study of small molecules, called metabolites, of a cell, tissue or organism. It is of particular interest as endogenous metabolites represent the phenotype resulting from gene expression. A major challenge in metabolomics research is the structural identification of unknown biochemical compounds in complex biofluids. In this paper we present an efficient cheminformatics tool, BioSMXpress that uses known endogenous mammalian biochemicals and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSMXpress identifies endogenous mammalian biochemical structures with high accuracy. BioSMXpress is 8 times faster than our previous work BioSM without compromising the accuracy of the predictions made. BioSMXpress is freely available at http://engr.uconn.edu/~rajasek/BioSMXpress.zip.


Assuntos
Bases de Dados Factuais , Metabolômica/métodos , Preparações Farmacêuticas/química , Bibliotecas de Moléculas Pequenas/química , Software , Animais , Mamíferos , Estrutura Molecular
18.
J Chem Inf Model ; 55(3): 709-18, 2015 Mar 23.
Artigo em Inglês | MEDLINE | ID: mdl-25668446

RESUMO

Metabolic pathways are composed of a series of chemical reactions occurring within a cell. In each pathway, enzymes catalyze the conversion of substrates into structurally similar products. Thus, structural similarity provides a potential means for mapping newly identified biochemical compounds to known metabolic pathways. In this paper, we present TrackSM, a cheminformatics tool designed to associate a chemical compound to a known metabolic pathway based on molecular structure matching techniques. Validation experiments show that TrackSM is capable of associating 93% of tested structures to their correct KEGG pathway class and 88% to their correct individual KEGG pathway. This suggests that TrackSM may be a valuable tool to aid in associating previously unknown small molecules to known biochemical pathways and improve our ability to link metabolomics, proteomic, and genomic data sets. TrackSM is freely available at http://metabolomics.pharm.uconn.edu/?q=Software.html .


Assuntos
Algoritmos , Redes e Vias Metabólicas , Metabolômica/métodos , Estrutura Molecular , Reprodutibilidade dos Testes , Software
19.
BMC Bioinformatics ; 15: 34, 2014 Jan 31.
Artigo em Inglês | MEDLINE | ID: mdl-24479443

RESUMO

BACKGROUND: Motif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences. One formulation of the problem is known as (l,d)-motif search or Planted Motif Search (PMS). In PMS we are given two integers l and d and n biological sequences. We want to find all sequences of length l that appear in each of the input sequences with at most d mismatches. The PMS problem is NP-complete. PMS algorithms are typically evaluated on certain instances considered challenging. Despite ample research in the area, a considerable performance gap exists because many state of the art algorithms have large runtimes even for moderately challenging instances. RESULTS: This paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithm to solve the challenging (l,d) instances (25,10) and (26,11). PMS8 is also efficient on instances with larger l and d such as (50,21). We include a comparison of PMS8 with several state of the art algorithms on multiple problem instances. This paper also presents necessary and sufficient conditions for 3 l-mers to have a common d-neighbor. The program is freely available at http://engr.uconn.edu/~man09004/PMS8/. CONCLUSIONS: We present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novel ideas for generating common neighborhoods. We have also implemented a parallel version for this algorithm. PMS8 can solve instances not solved by any previous algorithms.


Assuntos
Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , DNA/química , DNA/genética , Proteínas/química , Proteínas/genética
20.
BMC Genomics ; 15 Suppl 5: S5, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25081913

RESUMO

In the next generation sequencing techniques millions of short reads are produced from a genomic sequence at a single run. The chances of low read coverage to some regions of the sequence are very high. The reads are short and very large in number. Due to erroneous base calling, there could be errors in the reads. As a consequence, sequence assemblers often fail to sequence an entire DNA molecule and instead output a set of overlapping segments that together represent a consensus region of the DNA. This set of overlapping segments are collectively called contigs in the literature. The final step of the sequencing process, called scaffolding, is to assemble the contigs into a correct order. Scaffolding techniques typically exploit additional information such as mate-pairs, pair-ends, or optical restriction maps. In this paper we introduce a series of novel algorithms for scaffolding that exploit optical restriction maps (ORMs). Simulation results show that our algorithms are indeed reliable, scalable, and efficient compared to the best known algorithms in the literature.


Assuntos
Mapeamento de Sequências Contíguas , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Biologia Computacional , Genoma Bacteriano , Yersinia/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA