Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
1.
Brief Bioinform ; 22(1): 96-108, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-32568371

RESUMO

The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.


Assuntos
Monitoramento Epidemiológico , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Infecções por Vírus de RNA/virologia , Vírus de RNA/genética , Humanos , Infecções por Vírus de RNA/epidemiologia , Vírus de RNA/classificação , Vírus de RNA/isolamento & purificação , Vírus de RNA/patogenicidade
2.
BMC Bioinformatics ; 19(Suppl 11): 358, 2018 Oct 22.
Artigo em Inglês | MEDLINE | ID: mdl-30343674

RESUMO

BACKGROUND: Molecular surveillance and outbreak investigation are important for elimination of hepatitis C virus (HCV) infection in the United States. A web-based system, Global Hepatitis Outbreak and Surveillance Technology (GHOST), has been developed using Illumina MiSeq-based amplicon sequence data derived from the HCV E1/E2-junction genomic region to enable public health institutions to conduct cost-effective and accurate molecular surveillance, outbreak detection and strain characterization. However, as there are many factors that could impact input data quality to which the GHOST system is not completely immune, accuracy of epidemiological inferences generated by GHOST may be affected. Here, we analyze the data submitted to the GHOST system during its pilot phase to assess the nature of the data and to identify common quality concerns that can be detected and corrected automatically. RESULTS: The GHOST quality control filters were individually examined, and quality failure rates were measured for all samples, including negative controls. New filters were developed and introduced to detect primer dimers, loss of specimen-specific product, or short products. The genotyping tool was adjusted to improve the accuracy of subtype calls. The identification of "chordless" cycles in a transmission network from data generated with known laboratory-based quality concerns allowed for further improvement of transmission detection by GHOST in surveillance settings. Parameters derived to detect actionable common quality control anomalies were incorporated into the automatic quality control module that rejects data depending on the magnitude of a quality problem, and warns and guides users in performing correctional actions. The guiding responses generated by the system are tailored to the GHOST laboratory protocol. CONCLUSIONS: Several new quality control problems were identified in MiSeq data submitted to GHOST and used to improve protection of the system from erroneous data and users from erroneous inferences. The GHOST system was upgraded to include identification of causes of erroneous data and recommendation of corrective actions to laboratory users.


Assuntos
Surtos de Doenças/prevenção & controle , Vigilância da População/métodos , Automação , Técnicas de Genotipagem , Hepacivirus/fisiologia , Hepatite C/epidemiologia , Hepatite C/virologia , Humanos , Controle de Qualidade , Padrões de Referência , Estados Unidos
3.
BMC Genomics ; 18(Suppl 4): 392, 2017 05 24.
Artigo em Inglês | MEDLINE | ID: mdl-28589860

RESUMO

BACKGROUND: As crucial markers in identifying biological elements and processes in mammalian genomes, CpG islands (CGI) play important roles in DNA methylation, gene regulation, epigenetic inheritance, gene mutation, chromosome inactivation and nuclesome retention. The generally accepted criteria of CGI rely on: (a) %G+C content is ≥ 50%, (b) the ratio of the observed CpG content and the expected CpG content is ≥ 0.6, and (c) the general length of CGI is greater than 200 nucleotides. Most existing computational methods for the prediction of CpG island are programmed on these rules. However, many experimentally verified CpG islands deviate from these artificial criteria. Experiments indicate that in many cases %G+C is < 50%, CpG obs /CpG exp varies, and the length of CGI ranges from eight nucleotides to a few thousand of nucleotides. It implies that CGI detection is not just a straightly statistical task and some unrevealed rules probably are hidden. RESULTS: A novel Gaussian model, GaussianCpG, is developed for detection of CpG islands on human genome. We analyze the energy distribution over genomic primary structure for each CpG site and adopt the parameters from statistics of Human genome. The evaluation results show that the new model can predict CpG islands efficiently by balancing both sensitivity and specificity over known human CGI data sets. Compared with other models, GaussianCpG can achieve better performance in CGI detection. CONCLUSIONS: Our Gaussian model aims to simplify the complex interaction between nucleotides. The model is computed not by the linear statistical method but by the Gaussian energy distribution and accumulation. The parameters of Gaussian function are not arbitrarily designated but deliberately chosen by optimizing the biological statistics. By using the pseudopotential analysis on CpG islands, the novel model is validated on both the real and artificial data sets.


Assuntos
Ilhas de CpG/genética , Genoma Humano/genética , Sequenciamento Completo do Genoma , Humanos , Distribuição Normal
4.
BMC Genomics ; 17 Suppl 5: 542, 2016 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-27585456

RESUMO

BACKGROUND: Assessing pathway activity levels is a plausible way to quantify metabolic differences between various conditions. This is usually inferred from microarray expression data. Wide availability of NGS technology has triggered a demand for bioinformatics tools capable of analyzing pathway activity directly from RNA-Seq data. In this paper we introduce XPathway, a set of tools that compares pathway activity analyzing mapping of contigs assembled from RNA-Seq reads to KEGG pathways. The XPathway analysis of pathway activity is based on expectation maximization and topological properties of pathway graphs. RESULTS: XPathway tools have been applied to RNA-Seq data from the marine bryozoan Bugula neritina with and without its symbiotic bacterium "Candidatus Endobugula sertula". We successfully identified several metabolic pathways with differential activity levels. The expression of enzymes from the identified pathways has been further validated through quantitative PCR (qPCR). CONCLUSIONS: Our results show that XPathway is able to detect and quantify the metabolic difference in two samples. The software is implemented in C, Python and shell scripting and is capable of running on Linux/Unix platforms. The source code and installation instructions are available at http://alan.cs.gsu.edu/NGS/?q=content/xpathway .


Assuntos
Redes e Vias Metabólicas , Transcriptoma , Animais , Briozoários/genética , Briozoários/metabolismo , Biologia Computacional , Análise de Sequência de RNA , Software , Simbiose
5.
BMC Genomics ; 15 Suppl 8: S2, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25435284

RESUMO

A major application of RNA-Seq is to perform differential gene expression analysis. Many tools exist to analyze differentially expressed genes in the presence of biological replicates. Frequently, however, RNA-Seq experiments have no or very few biological replicates and development of methods for detecting differentially expressed genes in these scenarios is still an active research area. In this paper we introduce a novel method, called IsoDE, for differential gene expression analysis based on bootstrapping. We compared IsoDE against four existing methods (Fisher's exact test, GFOLD, edgeR and Cuffdiff) on RNA-Seq datasets generated using three different sequencing technologies, both with and without replicates. Experiments on MAQC RNA-Seq datasets without replicates show that IsoDE has consistently high accuracy as defined by the qPCR ground truth, frequently higher than that of the compared methods, particularly for low coverage data and at lower fold change thresholds. In experiments on RNA-Seq datasets with up to 7 replicates, IsoDE has also achieved high accuracy. Furthermore, unlike GFOLD and edgeR, IsoDE accuracy varies smoothly with the number of replicates, and is relatively uniform across the entire range of gene expression levels. The proposed non-parametric method based on bootstrapping has practical running time, and achieves robust performance over a broad range of technologies, number of replicates, sequencing depths, and minimum fold change thresholds.


Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Biologia Computacional , Software
6.
J Comput Biol ; 2024 Jun 27.
Artigo em Inglês | MEDLINE | ID: mdl-38934087

RESUMO

Evaluating changes in metabolic pathway activity is essential for studying disease mechanisms and developing new treatments, with significant benefits extending to human health. Here, we propose EMPathways2, a maximum likelihood pipeline that is based on the expectation-maximization algorithm, which is capable of evaluating enzyme expression and metabolic pathway activity level. We first estimate enzyme expression from RNA-seq data that is used for simultaneous estimation of pathway activity levels using enzyme participation levels in each pathway. We implement the novel pipeline to RNA-seq data from several groups of mice, which provides a deeper look at the biochemical changes occurring as a result of bacterial infection, disease, and immune response. Our results show that estimated enzyme expression, pathway activity levels, and enzyme participation levels in each pathway are robust and stable across all samples. Estimated activity levels of a significant number of metabolic pathways strongly correlate with the infected and uninfected status of the respective rodent types.

7.
Sci Rep ; 13(1): 4154, 2023 03 13.
Artigo em Inglês | MEDLINE | ID: mdl-36914815

RESUMO

The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.


Assuntos
Simulação por Computador , Genoma Viral , Aprendizado de Máquina , Projetos de Pesquisa , SARS-CoV-2 , Aprendizado de Máquina/normas , SARS-CoV-2/classificação , SARS-CoV-2/genética , Genoma Viral/genética , Proteínas Virais/genética , COVID-19/virologia , Análise de Sequência de RNA
8.
Biomolecules ; 13(6)2023 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-37371514

RESUMO

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Análise de Sequência de DNA/métodos , Pandemias , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Aprendizado de Máquina
9.
Artigo em Inglês | MEDLINE | ID: mdl-36103437

RESUMO

Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between k-mers (sub-sequences of length k) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods - they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain.

10.
J Comput Biol ; 28(8): 842-855, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34264744

RESUMO

In this article, we present our novel pipeline for analysis of metabolic activity using a microbial community's metatranscriptome sequence data set for validation. Our method is based on expectation-maximization (EM) algorithm and provides enzyme expression and pathway activity levels. Further expanding our analysis, we consider individual enzymatic activity and compute enzyme participation coefficients to approximate the metabolic pathway activity more accurately. We apply our EM pathways pipeline to a metatranscriptomic data set of a plankton community from surface waters of the Northern Gulf of Mexico. The data set consists of RNA-seq data and respective environmental parameters, which were sampled at two depths, six times a day over multiple 24-hour cycles. Furthermore, we discuss microbial dependence on day-night cycle within our findings based on a three-way correlation of the enzyme expression during antipodal times-midnight and noon. We show that the enzyme participation levels strongly affect the metabolic activity estimates: that is, marginal and multiple linear regression of enzymatic and metabolic pathway activity correlated significantly with the recorded environmental parameters. Our analysis statistically validates that EM-based methods produce meaningful results, as our method confirms statistically significant dependence of metabolic pathway activity on the environmental parameters, such as salinity, temperature, brightness, and a few others.


Assuntos
Bactérias/genética , Perfilação da Expressão Gênica/métodos , Redes e Vias Metabólicas , Plâncton/microbiologia , Algoritmos , Golfo do México , Modelos Lineares , Metagenômica , Análise de Sequência de RNA
11.
Bioinformatics ; 25(15): 1989-90, 2009 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-19414533

RESUMO

SUMMARY: The accumulation of high-throughput genomic, proteomic and metabolical data allows for increasingly accurate modeling and reconstruction of metabolic networks. Alignment of the reconstructed networks can help to catch model inconsistencies and infer missing elements. In this note, we present the web service tool MetNetAligner which aligns metabolic networks, taking in account the similarity of network topology and the enzymes' functions. It can be used for predicting unknown pathways, comparing and finding conserved patterns and resolving ambiguous identification of enzymes. The tool supports several alignment options including allowing or forbidding enzyme deletion and insertion. It is based on a novel scoring scheme which measures enzyme-to-enzyme functional similarity and a fast algorithm which efficiently finds optimal mappings from a directed graph with restricted cyclic structure to an arbitrary directed graph. AVAILABILITY: MetNetAligner is available as web-server at: http://alla.cs.gsu.edu:8080/MinePW/pages/gmapping/GMMain.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Redes e Vias Metabólicas , Alinhamento de Sequência/métodos , Software , Bases de Dados Factuais , Internet
12.
J Comput Biol ; 15(1): 81-90, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18199025

RESUMO

Accessibility of high-throughput genotyping technology allows genome-wide association studies for common complex diseases. This paper addresses two challenges commonly facing such studies: (i) searching an enormous amount of possible gene interactions and (ii) finding reproducible associations. These challenges have been traditionally addressed in statistics while here we apply computational approaches--optimization and cross-validation. A complex risk factor is modeled as a subset of single nucleotide polymorphisms (SNPs) with specified alleles and the optimization formulation asks for the one with the maximum odds ratio. To measure and compare ability of search methods to find reproducible risk factors, we propose to apply a cross-validation scheme usually used for prediction validation. We have applied and cross-validated known search methods with proposed enhancements on real case-control studies for several diseases (Crohn's disease, autoimmune disorder, tick-borne encephalitis, lung cancer, and rheumatoid arthritis). Proposed methods are compared favorably to the exhaustive search: they are faster, find more frequently statistically significant risk factors, and have significantly higher leave-half-out cross-validation rate.


Assuntos
Estudos de Casos e Controles , Biologia Computacional/métodos , Predisposição Genética para Doença , Intervalos de Confiança , Bases de Dados Genéticas , Humanos , Mutação/genética , Razão de Chances , Polimorfismo de Nucleotídeo Único/genética , Reprodutibilidade dos Testes , Software
13.
Artigo em Inglês | MEDLINE | ID: mdl-18451440

RESUMO

Emerging microarray technologies allow affordable typing of very long genome sequences. A key challenge in analyzing of such huge amount of data is scalable and accurate computational inferring of haplotypes (i.e., splitting of each genotype into a pair of corresponding haplotypes). In this paper, we first phase genotypes consisting only of two SNPs using genotypes frequencies adjusted to the random mating model and then extend phasing of two-SNP genotypes to phasing of complete genotypes using maximum spanning trees. Runtime of the proposed 2SNP algorithm is O(nm (n + log m), where n and m are the numbers of genotypes and SNPs, respectively, and it can handle genotypes spanning entire chromosomes in a matter of hours. On datasets across 23 chromosomal regions from HapMap[11], 2SNP is several orders of magnitude faster than GERBIL and PHASE while matching them in quality measured by the number of correctly phased genotypes, single-site and switching errors. For example the 2SNP software phases entire chromosome (10(5) SNPs from HapMap) for 30 individuals in 2 hours with average switching error 7.7%. We have also enhanced 2SNP algorithm to phase family trio data and compared it with four other well-known phasing methods on simulated data from [15]. 2SNP is much faster than all of them while loosing in quality only to PHASE. 2SNP software is publicly available at http://alla.cs.gsu.edu/~software/2SNP.


Assuntos
Algoritmos , Polimorfismo de Nucleotídeo Único , Biologia Computacional , Bases de Dados de Ácidos Nucleicos , Feminino , Genótipo , Haplótipos , Humanos , Masculino , Modelos Genéticos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Software
17.
J Comput Biol ; 14(7): 927-49, 2007 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-17803371

RESUMO

In this paper, we introduce a new method of combined synthesis and inference of biological signal transduction networks. A main idea of our method lies in representing observed causal relationships as network paths and using techniques from combinatorial optimization to find the sparsest graph consistent with all experimental observations. Our contributions are twofold: (a) We formalize our approach, study its computational complexity and prove new results for exact and approximate solutions of the computationally hard transitive reduction substep of the approach (Sections 2 and 5). (b) We validate the biological usability of our approach by successfully applying it to a previously published signal transduction network by Li et al. (2006) and show that our algorithm for the transitive reduction substep performs well on graphs with a structure similar to those observed in transcriptional regulatory and signal transduction networks.


Assuntos
Algoritmos , Mapeamento de Interação de Proteínas , Transdução de Sinais , Simulação por Computador , Matemática , Modelos Biológicos , Reprodutibilidade dos Testes
18.
Bioinformatics ; 22(20): 2558-61, 2006 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-16895924

RESUMO

UNLABELLED: The search for the association between complex diseases and single nucleotide polymorphisms (SNPs) or haplotypes has recently received great attention. For these studies, it is essential to use a small subset of informative SNPs accurately representing the rest of the SNPs. Informative SNP selection can achieve (1) considerable budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs or (2) necessary reduction of the huge SNP sets (obtained, e.g. from Affymetrix) for further fine haplotype analysis. A novel informative SNP selection method for unphased genotype data based on multiple linear regression (MLR) is implemented in the software package MLR-tagging. This software can be used for informative SNP (tag) selection and genotype prediction. The stepwise tag selection algorithm (STSA) selects positions of the given number of informative SNPs based on a genotype sample population. The MLR SNP prediction algorithm predicts a complete genotype based on the values of its informative SNPs, their positions among all SNPs, and a sample of complete genotypes. An extensive experimental study on various datasets including 10 regions from HapMap shows that the MLR prediction combined with stepwise tag selection uses fewer tags than the state-of-the-art method of Halperin et al. (2005). AVAILABILITY: MLR-Tagging software package is publicly available at http://alla.cs.gsu.edu/~software/tagging/tagging.html


Assuntos
Mapeamento Cromossômico/métodos , Análise Mutacional de DNA/métodos , Etiquetas de Sequências Expressas , Genótipo , Modelos Genéticos , Polimorfismo de Nucleotídeo Único/genética , Software , Algoritmos , Sequência de Bases , Simulação por Computador , Modelos Lineares , Dados de Sequência Molecular , Análise de Regressão
19.
IEEE Trans Nanobioscience ; 6(1): 60-7, 2007 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-17393851

RESUMO

The search for the association between complex diseases and single nucleotide polymorphisms (SNPs) or haplotypes has recently received great attention. For these studies, it is essential to use a small subset of informative SNPs, i.e., tag SNPs, accurately representing the rest of the SNPs. Tag SNP selection can achieve: 1) considerable budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs or 2) necessary reduction of the huge SNP sets (obtained, e.g., from Affymetrix) for further fine haplotype analysis. In this paper, we show that the tag SNP selection strongly depends on how the chosen tags will be used-advantage of one tag set over another can only be considered with respect to a certain prediction method. We show how to separate tag selection from SNP prediction and propose greedy and local-minimization algorithms for tag SNP selection. We give two novel approaches to SNP prediction based on multiple linear regression (MLR) and support vector machines (SVMs). An extensive experimental study on various datasets including ten regions from hapMap project shows that the MLR prediction combined with stepwise tag selection uses fewer tags than the state-of-the-art method of Halperin et al. The MLR-based method also uses on average 30% fewer tags than IdSelect for statistical covering all SNPs. The tag selection based on SVM SNP prediction uses fewer tags to achieve the same prediction accuracy as the methods of Halldorsson et al.


Assuntos
Algoritmos , Análise Mutacional de DNA/métodos , Etiquetas de Sequências Expressas , Haplótipos/genética , Desequilíbrio de Ligação/genética , Polimorfismo de Nucleotídeo Único/genética , Alinhamento de Sequência/métodos , Inteligência Artificial , Sequência de Bases , Simulação por Computador , Genótipo , Modelos Genéticos , Modelos Estatísticos , Dados de Sequência Molecular , Reconhecimento Automatizado de Padrão
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA