Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
Bioinformatics ; 35(21): 4411-4412, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-31038667

RESUMO

SUMMARY: Although heteroplasmy has been studied extensively in animal systems, there is a lack of tools for analyzing, exploring and visualizing heteroplasmy at the genome-wide level in other taxonomic systems. We introduce icHET, which is a computational workflow that produces an interactive visualization that facilitates the exploration, analysis and discovery of heteroplasmy across multiple genomic samples. icHET works on short reads from multiple samples from any organism with an organellar reference genome (mitochondrial or plastid) and a nuclear reference genome. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/vtphan/HeteroplasmyWorkflow. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Software , Animais , Genoma , Fluxo de Trabalho
2.
Bioinformatics ; 34(17): 2918-2926, 2018 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-29590294

RESUMO

Motivation: The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Results: We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost. Availability and implementation: Implementation can be found in our public code repository https://github.com/namsyvo/IVC. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Mutação INDEL , Algoritmos , Genoma Humano , Genômica/métodos , Humanos
3.
BMC Bioinformatics ; 18(Suppl 14): 499, 2017 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-29297282

RESUMO

BACKGROUND: Quantification and identification of microbial genomes based on next-generation sequencing data is a challenging problem in metagenomics. Although current methods have mostly focused on analyzing bacteria whose genomes have been sequenced, such analyses are, however, complicated by the presence of unknown bacteria or bacteria whose genomes have not been sequence. RESULTS: We propose a method for detecting unknown bacteria in environmental samples. Our approach is unique in its utilization of short reads only from 16S rRNA genes, not from entire genomes. We show that short reads from 16S rRNA genes retain sufficient information for detecting unknown bacteria in oral microbial communities. CONCLUSION: In our experimentation with bacterial genomes from the Human Oral Microbiome Database, we found that this method made accurate and robust predictions at different read coverages and percentages of unknown bacteria. Advantages of this approach include not only a reduction in experimental and computational costs but also a potentially high accuracy across environmental samples due to the strong conservation of the 16S rRNA gene.


Assuntos
Bactérias/genética , Bactérias/isolamento & purificação , Microbiota/genética , RNA Ribossômico 16S/genética , Algoritmos , Marcadores Genéticos , Genoma Bacteriano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metagenoma , Análise de Sequência de DNA/métodos
4.
BMC Bioinformatics ; 17(Suppl 13): 349, 2016 Oct 06.
Artigo em Inglês | MEDLINE | ID: mdl-27766935

RESUMO

Efforts such as International HapMap Project and 1000 Genomes Project resulted in a catalog of millions of single nucleotides and insertion/deletion (INDEL) variants of the human population. Viewed as a reference of existing variants, this resource commonly serves as a gold standard for studying and developing methods to detect genetic variants. Our analysis revealed that this reference contained thousands of INDELs that were constructed in a biased manner. This bias occurred at the level of aligning short reads to reference genomes to detect variants. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that agreed strongly or disagreed strongly with reported INDELs. This finding suggests that the agreement or disagreement between the aligners' called INDEL and the reported INDEL is merely a result of the arbitrary selection of one of the optimal alignments. The existence of bias in INDEL calling might have a serious influence in downstream analyses. As such, our finding suggests that this phenomenon should be further addressed.


Assuntos
Genoma Humano , Mutação INDEL , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Alelos , Confiabilidade dos Dados , Genômica/métodos , Humanos , Polimorfismo Genético
5.
BMC Bioinformatics ; 16 Suppl 17: S3, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26678826

RESUMO

BACKGROUND: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship. RESULTS: We investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes. CONCLUSIONS: We formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes.


Assuntos
Genoma , Alinhamento de Sequência/métodos , Animais , Sequência de Bases , Humanos , Análise de Sequência de DNA , Software , Fatores de Tempo
6.
BMC Bioinformatics ; 15 Suppl 11: S2, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25350806

RESUMO

BACKGROUND: The analysis of gene expression has played an important role in medical and bioinformatics research. Although it is known that a large number of samples is needed to determine the patterns of gene expression accurately, practical designs of gene expression studies occasionally have insufficient numbers of samples, making it difficult to ascertain true response patterns of variantly expressed genes. RESULTS: We describe an approach to cope with the challenge of predicting true orders of gene response to treatments. We show that true patterns of gene response must be orderable sets. In experiments with few samples, we modify the conventional pairwise comparison tests and increase the significance level α intelligently to deduce orderable patterns, which are most likely true orders of gene response. Additionally, motivated by the fact that a gene can be involved in multiple biological functions, our method further resamples experimental replicates and predicts multiple response patterns for each gene. CONCLUSIONS: This method can be useful in designing cost-effective experiments with small sample sizes. Patterns of highly-variantly expressed genes can be predicted by varying α intelligently. Furthermore, clusters are labeled meaningfully with patterns that describe precisely how genes in such clusters respond to treatments.


Assuntos
Perfilação da Expressão Gênica/métodos , Animais , Análise por Conglomerados , Redes Reguladoras de Genes , Ratos Sprague-Dawley , Tamanho da Amostra , Fatores de Transcrição/metabolismo
7.
BMC Genomics ; 15 Suppl 5: S2, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25081493

RESUMO

BACKGROUND: The alignment of short reads generated by next-generation sequencers to genomes is an important problem in many biomedical and bioinformatics applications. Although many proposed methods work very well on narrow ranges of read lengths, they tend to suffer in performance and alignment quality for reads outside of these ranges. RESULTS: We introduce RandAL, a novel method that aligns DNA sequences to reference genomes. Our approach utilizes two FM indices to facilitate efficient bidirectional searching, a pruning heuristic to speed up the computing of edit distances, and most importantly, a randomized strategy that enables effective estimation of key parameters. Extensive comparisons showed that RandAL outperformed popular aligners in most instances and was unique in its consistent and accurate performance over a wide range of read lengths and error rates. The software package is publicly available at https://github.com/namsyvo/RandAL. CONCLUSIONS: RandAL promises to align effectively and accurately short reads that come from a variety of technologies with different read lengths and rates of sequencing error.


Assuntos
Algoritmos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Biologia Computacional , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software
8.
Microbiome Res Rep ; 3(2): 25, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38841411

RESUMO

Objectives: This study introduces MetaBIDx, a computational method designed to enhance species prediction in metagenomic environments. The method addresses the challenge of accurate species identification in complex microbiomes, which is due to the large number of generated reads and the ever-expanding number of bacterial genomes. Bacterial identification is essential for disease diagnosis and tracing outbreaks associated with microbial infections. Methods: MetaBIDx utilizes a modified Bloom filter for efficient indexing of reference genomes and incorporates a novel strategy for reducing false positives by clustering species based on their genomic coverages by identified reads. The approach was evaluated and compared with several well-established tools across various datasets. Precision, recall, and F1-score were used to quantify the accuracy of species prediction. Results: MetaBIDx demonstrated superior performance compared to other tools, especially in terms of precision and F1-score. The application of clustering based on approximate coverages significantly improved precision in species identification, effectively minimizing false positives. We further demonstrated that other methods can also benefit from our approach to removing false positives by clustering species based on approximate coverages. Conclusion: With a novel approach to reducing false positives and the effective use of a modified Bloom filter to index species, MetaBIDx represents an advancement in metagenomic analysis. The findings suggest that the proposed approach could also benefit other metagenomic tools, indicating its potential for broader application in the field. The study lays the groundwork for future improvements in computational efficiency and the expansion of microbial databases.

9.
Front Big Data ; 5: 1018356, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36466712

RESUMO

Classifying or identifying bacteria in metagenomic samples is an important problem in the analysis of metagenomic data. This task can be computationally expensive since microbial communities usually consist of hundreds to thousands of environmental microbial species. We proposed a new method for representing bacteria in a microbial community using genomic signatures of those bacteria. With respect to the microbial community, the genomic signatures of each bacterium are unique to that bacterium; they do not exist in other bacteria in the community. Further, since the genomic signatures of a bacterium are much smaller than its genome size, the approach allows for a compressed representation of the microbial community. This approach uses a modified Bloom filter to store short k-mers with hash values that are unique to each bacterium. We show that most bacteria in many microbiomes can be represented uniquely using the proposed genomic signatures. This approach paves the way toward new methods for classifying bacteria in metagenomic samples.

10.
BMC Bioinformatics ; 12 Suppl 10: S19, 2011 Oct 18.
Artigo em Inglês | MEDLINE | ID: mdl-22165960

RESUMO

BACKGROUND: Identification of transcription factors (TFs) responsible for modulation of differentially expressed genes is a key step in deducing gene regulatory pathways. Most current methods identify TFs by searching for presence of DNA binding motifs in the promoter regions of co-regulated genes. However, this strategy may not always be useful as presence of a motif does not necessarily imply a regulatory role. Conversely, motif presence may not be required for a TF to regulate a set of genes. Therefore, it is imperative to include functional (biochemical and molecular) associations, such as those found in the biomedical literature, into algorithms for identification of putative regulatory TFs that might be explicitly or implicitly linked to the genes under investigation. RESULTS: In this study, we present a Latent Semantic Indexing (LSI) based text mining approach for identification and ranking of putative regulatory TFs from microarray derived differentially expressed genes (DEGs). Two LSI models were built using different term weighting schemes to devise pair-wise similarities between 21,027 mouse genes annotated in the Entrez Gene repository. Amongst these genes, 433 were designated TFs in the TRANSFAC database. The LSI derived TF-to-gene similarities were used to calculate TF literature enrichment p-values and rank the TFs for a given set of genes. We evaluated our approach using five different publicly available microarray datasets focusing on TFs Rel, Stat6, Ddit3, Stat5 and Nfic. In addition, for each of the datasets, we constructed gold standard TFs known to be functionally relevant to the study in question. Receiver Operating Characteristics (ROC) curves showed that the log-entropy LSI model outperformed the tf-normal LSI model and a benchmark co-occurrence based method for four out of five datasets, as well as motif searching approaches, in identifying putative TFs. CONCLUSIONS: Our results suggest that our LSI based text mining approach can complement existing approaches used in systems biology research to decipher gene regulatory networks by providing putative lists of ranked TFs that might be explicitly or implicitly associated with sets of DEGs derived from microarray experiments. In addition, unlike motif searching approaches, LSI based approaches can reveal TFs that may indirectly regulate genes.


Assuntos
Algoritmos , Mineração de Dados/métodos , Redes Reguladoras de Genes , Análise de Sequência com Séries de Oligonucleotídeos , Fatores de Transcrição/isolamento & purificação , Motivos de Aminoácidos , Animais , Humanos , Camundongos , PubMed , Biologia de Sistemas , Fatores de Transcrição/química , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
11.
Genes (Basel) ; 11(8)2020 08 17.
Artigo em Inglês | MEDLINE | ID: mdl-32824429

RESUMO

Most current approach to metagenomic classification employ short next generation sequencing (NGS) reads that are present in metagenomic samples to identify unique genomic regions. NGS reads, however, might not be long enough to differentiate similar genomes. This suggests a potential for using longer reads to improve classification performance. Presently, longer reads tend to have a higher rate of sequencing errors. Thus, given the pros and cons, it remains unclear which types of reads is better for metagenomic classification. We compared two taxonomic classification protocols: a traditional assembly-free protocol and a novel assembly-based protocol. The novel assembly-based protocol consists of assembling short-reads into longer reads, which will be subsequently classified by a traditional taxonomic classifier. We discovered that most classifiers made fewer predictions with longer reads and that they achieved higher classification performance on synthetic metagenomic data. Generally, we observed a significant increase in precision, while having similar recall rates. On real data, we observed similar characteristics that suggest that the classifiers might have similar performance of higher precision with similar recall with longer reads. We have shown a noticeable difference in performance between assembly-based and assembly-free taxonomic classification. This finding strongly suggests that classifying species in metagenomic environments can be achieved with higher overall performance simply by assembling short reads. Further, it also suggests that long-read technologies might be better for species classification.


Assuntos
Código de Barras de DNA Taxonômico , Metagenoma , Metagenômica , Biologia Computacional , Código de Barras de DNA Taxonômico/métodos , Metagenômica/métodos , Reprodutibilidade dos Testes , Fluxo de Trabalho
12.
Carcinogenesis ; 30(3): 480-6, 2009 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-19126641

RESUMO

3H-1,2-dithiole-3-thione (D3T) and its analogues 4-methyl-5-pyrazinyl-3H-1,2-dithiole-3-thione (OLT) and 5-tert-butyl-3H-1,2-dithiole-3-thione (TBD) are chemopreventive agents that block or diminish early stages of carcinogenesis by inducing activities of detoxication enzymes. While OLT has been used in clinical trials, TBD has been shown to be more efficacious and possibly less toxic than OLT in animals. Here, we utilize a robust and high-resolution chemical genomics procedure to examine the pharmacological structure-activity relationships of these compounds in livers of male rats by microarray analyses. We identified 226 differentially expressed genes that were common to all treatments. Functional analysis identified the relation of these genes to glutathione metabolism and the nuclear factor, erythroid derived 2-related factor 2 pathway (Nrf2) that is known to regulate many of the protective actions of dithiolethiones. OLT and TBD were shown to have similar efficacies and both were weaker than D3T. In addition, we identified 40 genes whose responses were common to OLT and TBD, yet distinct from D3T. As inhibition of cytochrome P450 (CYP) has been associated with the effects of OLT on CYP expression, we determined the half maximal inhibitory concentration (IC(50)) values for inhibition of CYP1A2. The rank order of inhibitor potency was OLT >> TBD >> D3T, with IC(50) values estimated as 0.2, 12.8 and >100 microM, respectively. Functional analysis revealed that OLT and TBD, in addition to their effects on CYP, modulate liver lipid metabolism, especially fatty acids. Together, these findings provide new insight into the actions of clinically relevant and lead dithiolethione analogues.


Assuntos
Anticarcinógenos , Perfilação da Expressão Gênica , Compostos Heterocíclicos com 1 Anel , Tionas , Tiofenos , Animais , Masculino , Ratos , Anticarcinógenos/farmacologia , Citocromo P-450 CYP1A2/metabolismo , Genômica , Glutationa/metabolismo , Compostos Heterocíclicos com 1 Anel/farmacologia , Fígado/efeitos dos fármacos , Fígado/metabolismo , Família Multigênica , Análise de Sequência com Séries de Oligonucleotídeos , Pirazinas , Ratos Endogâmicos F344 , Relação Estrutura-Atividade , Tionas/farmacologia , Tiofenos/farmacologia , Fator 2 Relacionado a NF-E2/metabolismo
13.
Bioinformatics ; 24(24): 2930-1, 2008 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-19017656

RESUMO

MOTIVATION: Motif Tool Manager is a web-based framework for comparing and combining different approaches to discover novel DNA motifs. It comes with a set of five well-known approaches to motif discovery. It provides an easy mechanism for adding new motif finding tools to the framework through a web-interface and a minimal setup of the tools on the server. Users can execute the tools through the web-based framework and compare results from such executions. The framework provides a basic mechanism for identifying the most similar motif candidates found by a majority of themotif finding tools. AVAILABILITY: http://cetus.cs.memphis.edu/motif


Assuntos
Análise de Sequência de DNA/métodos , Software , Algoritmos , DNA/química , Internet
14.
J Bioinform Comput Biol ; 7(1): 135-56, 2009 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-19226664

RESUMO

Post hoc assignment of patterns determined by all pairwise comparisons in microarray experiments with multiple treatments has been proven to be useful in assessing treatment effects. We propose the usage of transitive directed acyclic graphs (tDAG) as the representation of these patterns and show that such representation can be useful in clustering treatment effects, annotating existing clustering methods, and analyzing sample sizes. Advantages of this approach include: (1) unique and descriptive meaning of each cluster in terms of how genes respond to all pairs of treatments; (2) insensitivity of the observed patterns to the number of genes analyzed; and (3) a combinatorial perspective to address the sample size problem by observing the rate of contractible tDAG as the number of replicates increases. The advantages and overall utility of the method in elaborating drug structure activity relationships are exemplified in a controlled study with real and simulated data.


Assuntos
Algoritmos , Inteligência Artificial , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reconhecimento Automatizado de Padrão/métodos , Apresentação de Dados
15.
J Bioinform Comput Biol ; 15(3): 1740001, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28345370

RESUMO

Determining abundances of microbial genomes in metagenomic samples is an important problem in analyzing metagenomic data. Although homology-based methods are popular, they have shown to be computationally expensive due to the alignment of tens of millions of reads from metagenomic samples to reference genomes of hundreds to thousands of environmental microbial species. We introduce an efficient alignment-free approach to estimate abundances of microbial genomes in metagenomic samples. The approach is based on solving linear and quadratic programs, which are represented by genome-specific markers (GSM). We compared our method against popular alignment-free and homology-based methods. Without contamination, our method was more accurate than other alignment-free methods while being much faster than a homology-based method. In more realistic settings where samples were contaminated with human DNA, our method was the most accurate method in predicting abundance at varying levels of contamination. We achieve higher accuracy than both alignment-free and homology-based methods.


Assuntos
Metagenômica/métodos , Consórcios Microbianos/genética , Análise de Sequência de DNA/métodos , Bases de Dados Genéticas , Marcadores Genéticos , Genoma
16.
Int J Bioinform Res Appl ; 6(1): 21-36, 2010.
Artigo em Inglês | MEDLINE | ID: mdl-20110207

RESUMO

We propose a novel method to estimate editing efficiency by adenosine deaminases that act on RNA (ADARs). The method employs the notion of stability of secondary structure in the vicinity of edited sites during transcription. Such an analysis of 'dynamic' structural motifs of RNA is important because as a pre-spliced RNA is being transcribed and elongated, its entire structure, and thus its local structures, may change drastically. Our simulation showed that the stability of structures in the vicinity of edited sites correlates moderately highly with editing efficiency of edited sites recently established in laboratory experiments.


Assuntos
Adenosina Desaminase/química , Edição de RNA , RNA/química , Adenosina , Adenosina Desaminase/metabolismo , Sequência de Bases , Simulação por Computador , Dados de Sequência Molecular , Conformação de Ácido Nucleico
17.
Int J Data Min Bioinform ; 4(4): 377-94, 2010.
Artigo em Inglês | MEDLINE | ID: mdl-20815138

RESUMO

Hidden stops are nucleotide triples TAA, TAG and TGA that appear on the second and third reading frames of a protein coding gene. Recent studies suggested the important role of hidden stops in preventing misread of mRNA. We study the problem of designing protein-encoding genes with large number of hidden stops under several biological constraints. With simple constraints, redesigned genes have provable maximal number of hidden stops. With more complex constraints, redesigned genes still have many more hidden stops than wild-type genes. We showed that redesigned genes have a distinct positional advantage in assisting early termination of frame-shifts.


Assuntos
Genes Sintéticos , Sequência de Bases , Códon de Terminação , Fases de Leitura Aberta , RNA Mensageiro/metabolismo
18.
Int J Comput Biol Drug Des ; 1(2): 174-84, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-20058488

RESUMO

Proper management of bioinformatics data and tools is crucial because the amount of data is enormous, the type of data varies, and there are often different approaches (and consequently tools) for solving a particular problem. While specialised systems exist to serve specific needs, such systems are difficult to adapt and require large resource commitments for development and maintenance. We propose a system called Bioinformatics Tools and Data Management System (BioTDMS) that uses open-source technologies to provide a platform for managing both data and tools. We present case studies that show some potential applications of this system.


Assuntos
Algoritmos , Biologia Computacional/métodos , Projetos de Pesquisa , Gestão da Informação , Internet , Software
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa