Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
PLoS Comput Biol ; 16(11): e1008415, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33175836

RESUMO

Small non-coding RNAs (ncRNAs) are short non-coding sequences involved in gene regulation in many biological processes and diseases. The lack of a complete comprehension of their biological functionality, especially in a genome-wide scenario, has demanded new computational approaches to annotate their roles. It is widely known that secondary structure is determinant to know RNA function and machine learning based approaches have been successfully proven to predict RNA function from secondary structure information. Here we show that RNA function can be predicted with good accuracy from a lightweight representation of sequence information without the necessity of computing secondary structure features which is computationally expensive. This finding appears to go against the dogma of secondary structure being a key determinant of function in RNA. Compared to recent secondary structure based methods, the proposed solution is more robust to sequence boundary noise and reduces drastically the computational cost allowing for large data volume annotations. Scripts and datasets to reproduce the results of experiments proposed in this study are available at: https://github.com/bioinformatics-sannio/ncrna-deep.


Assuntos
Aprendizado Profundo , RNA não Traduzido/genética , RNA não Traduzido/fisiologia , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Método de Monte Carlo , Redes Neurais de Computação , Conformação de Ácido Nucleico , RNA não Traduzido/química , Análise de Sequência de RNA/estatística & dados numéricos , Sequenciamento do Exoma/estatística & dados numéricos
2.
Brief Bioinform ; 20(1): 288-298, 2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-29028903

RESUMO

RNA sequencing (RNA-seq) has become a standard procedure to investigate transcriptional changes between conditions and is routinely used in research and clinics. While standard differential expression (DE) analysis between two conditions has been extensively studied, and improved over the past decades, RNA-seq time course (TC) DE analysis algorithms are still in their early stages. In this study, we compare, for the first time, existing TC RNA-seq tools on an extensive simulation data set and validated the best performing tools on published data. Surprisingly, TC tools were outperformed by the classical pairwise comparison approach on short time series (<8 time points) in terms of overall performance and robustness to noise, mostly because of high number of false positives, with the exception of ImpulseDE2. Overlapping of candidate lists between tools improved this shortcoming, as the majority of false-positive, but not true-positive, candidates were unique for each method. On longer time series, pairwise approach was less efficient on the overall performance compared with splineTC and maSigPro, which did not identify any false-positive candidate.


Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Teorema de Bayes , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Cadeias de Markov , Modelos Estatísticos , Anotação de Sequência Molecular/estatística & dados numéricos , Análise de Sequência de RNA/estatística & dados numéricos , Razão Sinal-Ruído , Software , Fatores de Tempo
3.
Brief Bioinform ; 20(4): 1222-1237, 2019 07 19.
Artigo em Inglês | MEDLINE | ID: mdl-29220512

RESUMO

MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. RESULTS: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. AVAILABILITY: The source code of the benchmarking tool is available as Supplementary Materials.


Assuntos
Biologia Computacional/métodos , Modelos Estatísticos , Análise de Sequência de DNA/estatística & dados numéricos , Algoritmos , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Cadeias de Markov , Alinhamento de Sequência/estatística & dados numéricos
4.
Pac Symp Biocomput ; 21: 456-67, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26776209

RESUMO

Small non-coding RNAs (sRNAs) are regulatory RNA molecules that have been identified in a multitude of bacterial species and shown to control numerous cellular processes through various regulatory mechanisms. In the last decade, next generation RNA sequencing (RNA-seq) has been used for the genome-wide detection of bacterial sRNAs. Here we describe sRNA-Detect, a novel approach to identify expressed small transcripts from prokaryotic RNA-seq data. Using RNA-seq data from three bacterial species and two sequencing platforms, we performed a comparative assessment of five computational approaches for the detection of small transcripts. We demonstrate that sRNA-Detect improves upon current standalone computational approaches for identifying novel small transcripts in bacteria.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , RNA Bacteriano/genética , Pequeno RNA não Traduzido/genética , Análise de Sequência de RNA/estatística & dados numéricos , Algoritmos , Sequência de Bases , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Deinococcus/genética , Erwinia amylovora/genética , Cadeias de Markov , Rhodobacter capsulatus/genética , Software , Design de Software
5.
J Bioinform Comput Biol ; 13(2): 1550004, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25491390

RESUMO

To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.


Assuntos
Códon/genética , DNA/genética , Análise de Sequência de DNA/estatística & dados numéricos , Algoritmos , Animais , Sequência de Bases , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Éxons , Análise de Fourier , Código Genético , Humanos , Íntrons , Método de Monte Carlo , Fases de Leitura Aberta , Processamento de Sinais Assistido por Computador , Razão Sinal-Ruído
6.
Biostatistics ; 14(3): 600-11, 2013 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-23428932

RESUMO

Copy number variations (CNVs) are a significant source of genetic variation and have been found frequently associated with diseases such as cancers and autism. High-throughput sequencing data are increasingly being used to detect and quantify CNVs; however, the distributional properties of the data are not fully understood. A hidden Markov model (HMM) is proposed using inhomogeneous emission distributions based on negative binomial regression to account for the sequencing biases. The model is tested on the whole genome sequencing data and simulated data sets. An algorithm for CNV detection is implemented in the R package CNVfinder. The model based on negative binomial regression is shown to provide a good fit to the data and provides competitive performance compared with methods based on normalization of read counts.


Assuntos
Variações do Número de Cópias de DNA , Modelos Genéticos , Modelos Estatísticos , Algoritmos , Distribuição Binomial , Bioestatística , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Cadeias de Markov , Software
7.
Biostatistics ; 14(2): 244-58, 2013 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-23074263

RESUMO

Motivated by studying the association between nutrient intake and human gut microbiome composition, we developed a method for structure-constrained sparse canonical correlation analysis (ssCCA) in a high-dimensional setting. ssCCA takes into account the phylogenetic relationships among bacteria, which provides important prior knowledge on evolutionary relationships among bacterial taxa. Our ssCCA formulation utilizes a phylogenetic structure-constrained penalty function to impose certain smoothness on the linear coefficients according to the phylogenetic relationships among the taxa. An efficient coordinate descent algorithm is developed for optimization. A human gut microbiome data set is used to illustrate this method. Both simulations and real data applications show that ssCCA performs better than the standard sparse CCA in identifying meaningful variables when there are structures in the data.


Assuntos
Metagenoma/genética , Algoritmos , Bactérias/classificação , Bactérias/genética , Bioestatística , Simulação por Computador , DNA Bacteriano/genética , Interpretação Estatística de Dados , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Sistema Digestório/microbiologia , Genoma Bacteriano , Humanos , Método de Monte Carlo , Filogenia
8.
Biometrics ; 68(3): 774-83, 2012 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-22260651

RESUMO

DNA methylation has emerged as an important hallmark of epigenetics. Numerous platforms including tiling arrays and next generation sequencing, and experimental protocols are available for profiling DNA methylation. Similar to other tiling array data, DNA methylation data shares the characteristics of inherent correlation structure among nearby probes. However, unlike gene expression or protein DNA binding data, the varying CpG density which gives rise to CpG island, shore and shelf definition provides exogenous information in detecting differential methylation. This article aims to introduce a robust testing and probe ranking procedure based on a nonhomogeneous hidden Markov model that incorporates the above-mentioned features for detecting differential methylation. We revisit the seminal work of Sun and Cai (2009, Journal of the Royal Statistical Society: Series B (Statistical Methodology)71, 393-424) and propose modeling the nonnull using a nonparametric symmetric distribution in two-sided hypothesis testing. We show that this model improves probe ranking and is robust to model misspecification based on extensive simulation studies. We further illustrate that our proposed framework achieves good operating characteristics as compared to commonly used methods in real DNA methylation data that aims to detect differential methylation sites.


Assuntos
Biometria/métodos , Metilação de DNA , Modelos Estatísticos , Ilhas de CpG , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Epigênese Genética , Humanos , Cadeias de Markov , Modelos Genéticos , Mutação , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Probabilidade
9.
J Bioinform Comput Biol ; 9(1): 131-48, 2011 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21328710

RESUMO

DNA copy number (DCN) is the number of copies of DNA at a region of a genome. The alterations of DCN are highly associated with the development of different tumors. Recently, microarray technologies are being employed to detect DCN changes at many loci at the same time in tumor samples. The resulting DCN data are often very noisy, and the tumor sample is often contaminated by normal cells. The goal of computational analysis of array-based DCN data is to infer the underlying DCNs from raw DCN data. Previous methods for this task do not model the tumor/normal cell mixture ratio explicitly and they cannot output segments with DCN annotations. We developed a novel model-based method using the minimum description length (MDL) principle for DCN data segmentation. Our new method can output underlying DCN for each chromosomal segment, and at the same time, infer the underlying tumor proportion in the test samples. Empirical results show that our method achieves better accuracies on average as compared to three previous methods, namely Circular Binary Segmentation, Hidden Markov Model and Ultrasome.


Assuntos
Variações do Número de Cópias de DNA , Algoritmos , Biologia Computacional , Simulação por Computador , DNA de Neoplasias/genética , Interpretação Estatística de Dados , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Humanos , Cadeias de Markov , Modelos Estatísticos , Neoplasias/genética , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA