Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
BMC Bioinformatics ; 18(Suppl 12): 414, 2017 Oct 16.
Artículo en Inglés | MEDLINE | ID: mdl-29072140

RESUMEN

BACKGROUND: Homology search is still a significant step in functional analysis for genomic data. Profile Hidden Markov Model-based homology search has been widely used in protein domain analysis in many different species. In particular, with the fast accumulation of transcriptomic data of non-model species and metagenomic data, profile homology search is widely adopted in integrated pipelines for functional analysis. While the state-of-the-art tool HMMER has achieved high sensitivity and accuracy in domain annotation, the sensitivity of HMMER on short reads declines rapidly. The low sensitivity on short read homology search can lead to inaccurate domain composition and abundance computation. Our experimental results showed that half of the reads were missed by HMMER for a RNA-Seq dataset. Thus, there is a need for better methods to improve the homology search performance for short reads. RESULTS: We introduce a profile homology search tool named Short-Pair that is designed for short paired-end reads. By using an approximate Bayesian approach employing distribution of fragment lengths and alignment scores, Short-Pair can retrieve the missing end and determine true domains. In particular, Short-Pair increases the accuracy in aligning short reads that are part of remote homologs. We applied Short-Pair to a RNA-Seq dataset and a metagenomic dataset and quantified its sensitivity and accuracy on homology search. The experimental results show that Short-Pair can achieve better overall performance than the state-of-the-art methodology of profile homology search. CONCLUSIONS: Short-Pair is best used for next-generation sequencing (NGS) data that lack reference genomes. It provides a complementary paired-end read homology search tool to HMMER. The source code is freely available at https://sourceforge.net/projects/short-pair/ .


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Homología de Secuencia de Ácido Nucleico , Secuencia de Aminoácidos , Arabidopsis/genética , Teorema de Bayes , Metagenómica , Curva ROC , Alineación de Secuencia , Análisis de Secuencia de ARN , Programas Informáticos , Factores de Tiempo
2.
Bioinformatics ; 32(17): i520-i528, 2016 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-27587670

RESUMEN

MOTIVATION: Clustered regularly interspaced short palindromic repeats and associated proteins (CRISPR-Cas) allows more specific and efficient gene editing than all previous genetic engineering systems. These exciting discoveries stem from the finding of the CRISPR system being an adaptive immune system that protects the prokaryotes against exogenous genetic elements such as phages. Despite the exciting discoveries, almost all knowledge about CRISPRs is based only on microorganisms that can be isolated, cultured and sequenced in labs. However, about 95% of bacterial species cannot be cultured in labs. The fast accumulation of metagenomic data, which contains DNA sequences of microbial species from natural samples, provides a unique opportunity for CRISPR annotation in uncultivable microbial species. However, the large amount of data, heterogeneous coverage and shared leader sequences of some CRISPRs pose challenges for identifying CRISPRs efficiently in metagenomic data. RESULTS: In this study, we developed a CRISPR finding tool for metagenomic data without relying on generic assembly, which is error-prone and computationally expensive for complex data. Our tool can run on commonly available machines in small labs. It employs properties of CRISPRs to decompose generic assembly into local assembly. We tested it on both mock and real metagenomic data and benchmarked the performance with state-of-the-art tools. AVAILABILITY AND IMPLEMENTATION: The source code and the documentation of metaCRISPR is available at https://github.com/hangelwen/metaCRISPR CONTACT: yannisun@msu.edu.


Asunto(s)
Bacteriófagos , Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas , Metagenómica , Variación Genética , Células Procariotas
3.
Bioinformatics ; 31(12): i35-43, 2015 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-26072503

RESUMEN

UNLABELLED: Metagenomic data, which contains sequenced DNA reads of uncultured microbial species from environmental samples, provide a unique opportunity to thoroughly analyze microbial species that have never been identified before. Reconstructing 16S ribosomal RNA, a phylogenetic marker gene, is usually required to analyze the composition of the metagenomic data. However, massive volume of dataset, high sequence similarity between related species, skewed microbial abundance and lack of reference genes make 16S rRNA reconstruction difficult. Generic de novo assembly tools are not optimized for assembling 16S rRNA genes. In this work, we introduce a targeted rRNA assembly tool, REAGO (REconstruct 16S ribosomal RNA Genes from metagenOmic data). It addresses the above challenges by combining secondary structure-aware homology search, zproperties of rRNA genes and de novo assembly. Our experimental results show that our tool can correctly recover more rRNA genes than several popular generic metagenomic assembly tools and specially designed rRNA construction tools. AVAILABILITY AND IMPLEMENTATION: The source code of REAGO is freely available at https://github.com/chengyuan/reago.


Asunto(s)
Biología Computacional/métodos , Genes de ARNr/genética , Metagenómica/métodos , Microbiota/genética , ARN Ribosómico 16S/genética , Bases de Datos de Ácidos Nucleicos , Filogenia , ARN Bacteriano/genética , Análisis de Secuencia de ARN
4.
Plant Physiol ; 167(1): 25-39, 2015 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-25384563

RESUMEN

The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes.


Asunto(s)
Genes de Plantas/genética , Genoma de Planta/genética , Anotación de Secuencia Molecular/métodos , Zea mays/genética , Bases de Datos Genéticas/normas , Exones/genética , Intrones/genética , Modelos Genéticos , Anotación de Secuencia Molecular/normas , Seudogenes/genética , Control de Calidad , ARN no Traducido/genética
5.
Bioinformatics ; 30(19): 2837-9, 2014 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-24930140

RESUMEN

SUMMARY: Plant microRNA prediction tools that use small RNA-sequencing data are emerging quickly. These existing tools have at least one of the following problems: (i) high false-positive rate; (ii) long running time; (iii) work only for genomes in their databases; (iv) hard to install or use. We developed miR-PREFeR (miRNA PREdiction From small RNA-Seq data), which uses expression patterns of miRNA and follows the criteria for plant microRNA annotation to accurately predict plant miRNAs from one or more small RNA-Seq data samples of the same species. We tested miR-PREFeR on several plant species. The results show that miR-PREFeR is sensitive, accurate, fast and has low-memory footprint. AVAILABILITY AND IMPLEMENTATION: https://github.com/hangelwen/miR-PREFeR


Asunto(s)
Biología Computacional/métodos , MicroARNs/metabolismo , Análisis de Secuencia de ARN/métodos , Algoritmos , Secuencia de Bases , Benchmarking , Reacciones Falso Positivas , Genoma , Plantas/genética , Reproducibilidad de los Resultados , Programas Informáticos
6.
Plant Physiol ; 164(2): 513-24, 2014 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-24306534

RESUMEN

We have optimized and extended the widely used annotation engine MAKER in order to better support plant genome annotation efforts. New features include better parallelization for large repeat-rich plant genomes, noncoding RNA annotation capabilities, and support for pseudogene identification. We have benchmarked the resulting software tool kit, MAKER-P, using the Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) genomes. Here, we demonstrate the ability of the MAKER-P tool kit to automatically update, extend, and revise the Arabidopsis annotations in light of newly available data and to annotate pseudogenes and noncoding RNAs absent from The Arabidopsis Informatics Resource 10 build. Our results demonstrate that MAKER-P can be used to manage and improve the annotations of even Arabidopsis, perhaps the best-annotated plant genome. We have also installed and benchmarked MAKER-P on the Texas Advanced Computing Center. We show that this public resource can de novo annotate the entire Arabidopsis and maize genomes in less than 3 h and produce annotations of comparable quality to those of the current The Arabidopsis Information Resource 10 and maize V2 annotation builds.


Asunto(s)
Arabidopsis/genética , Biología Computacional/métodos , Genoma de Planta/genética , Anotación de Secuencia Molecular/métodos , Programas Informáticos , Zea mays/genética , Empalme Alternativo/genética , Exones/genética , Genes de Plantas/genética , Seudogenes/genética , Secuencias Repetitivas de Ácidos Nucleicos/genética , Reproducibilidad de los Resultados
7.
BMC Bioinformatics ; 13 Suppl 3: S12, 2012 Mar 21.
Artículo en Inglés | MEDLINE | ID: mdl-22536896

RESUMEN

BACKGROUND: NCRNAs (noncoding RNAs) play important roles in many biological processes. Existing genome-scale ncRNA search tools identify ncRNAs in local sequence alignments generated by conventional sequence comparison methods. However, some types of ncRNA lack strong sequence conservation and tend to be missed or mis-aligned by conventional sequence comparison. RESULTS: In this paper, we propose an ncRNA identification framework that is complementary to existing sequence comparison tools. By integrating a filtration step based on Hamming distance and ncRNA alignment programs such as FOLDALIGN or PLAST-ncRNA, the proposed ncRNA search framework can identify ncRNAs that lack strong sequence conservation. In addition, as the ratio of transition and transversion mutation is often used as a discriminative feature for functional ncRNA identification, we incorporate this feature into the filtration step using a coding strategy. We apply Hamming distance seeds to ncRNA search in the intergenic regions of human and mouse genomes and between the Burkholderia cenocepacia J2315 genome and the Ralstonia solanacearum genome. The experimental results demonstrate that a carefully designed Hamming distance seed can achieve better sensitivity in searching for poorly conserved ncRNAs than conventional sequence comparison tools. CONCLUSIONS: Hamming distance seeds provide better sensitivity as a filtration strategy for genome-wide ncRNA homology search than the existing seeding strategies used in BLAST-like tools. By combining Hamming distance seeds matching and ncRNA alignment, we are able to find ncRNAs with sequence similarities below 60%.


Asunto(s)
Algoritmos , ARN no Traducido/genética , ARN no Traducido/aislamiento & purificación , Alineación de Secuencia/métodos , Animales , Secuencia de Bases , Burkholderia cenocepacia/genética , Secuencia Conservada , Genoma , Genoma Bacteriano , Humanos , Ratones , Datos de Secuencia Molecular , ARN Bacteriano/química , ARN Bacteriano/genética , ARN Bacteriano/aislamiento & purificación , ARN no Traducido/química , Ralstonia solanacearum/genética
8.
Artículo en Inglés | MEDLINE | ID: mdl-23929857

RESUMEN

Noncoding RNA (ncRNA) identification is highly important to modern biology. The state-of-the-art method for ncRNA identification is based on comparative genomics, in which evolutionary conservations of sequences and secondary structures provide important evidence for ncRNA search. For ncRNAs with low sequence conservation but high structural similarity, conventional local alignment tools such as BLAST yield low sensitivity. Thus, there is a need for ncRNA search methods that can incorporate both sequence and structural similarities. We introduce chain-RNA, a pairwise structural alignment tool that can effectively locate cross-species conserved RNA elements with low sequence similarity. In chain-RNA, stem-loop structures are extracted from dot plots generated by an efficient local-folding algorithm. Then, we formulate stem alignment as an extended 2D chain problem and employ existing chain algorithms. Chain-RNA is tested on a data set containing annotated ncRNA homologs and is applied to novel ncRNA search in a transcriptomic data set. The experimental results show that chain-RNA has better tradeoff between sensitivity and false positive rate in ncRNA prediction than conventional sequence similarity search tools and is more time efficient than structural alignment tools. The source codes of chain-RNA can be downloaded at http://sourceforge.net/projects/chain-rna/ or at http://www.cse.msu.edu/~leijikai/chain-rna/.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Modelos Genéticos , Conformación de Ácido Nucleico , ARN no Traducido/química , ARN no Traducido/genética , Bases de Datos Genéticas , Genoma Bacteriano , Curva ROC , Alineación de Secuencia , Análisis de Secuencia de ARN
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA