Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 78
Filtrar
Más filtros

Bases de datos
Tipo del documento
Intervalo de año de publicación
1.
Genome Res ; 34(2): 326-340, 2024 Mar 20.
Artículo en Inglés | MEDLINE | ID: mdl-38428994

RESUMEN

Pacific Biosciences (PacBio) HiFi sequencing technology generates long reads (>10 kbp) with very high accuracy (<0.01% sequencing error). Although several de novo assembly tools are available for HiFi reads, there are no comprehensive studies on the evaluation of these assemblers. We evaluated the performance of 11 de novo HiFi assemblers on (1) real data for three eukaryotic genomes; (2) 34 synthetic data sets with different ploidy, sequencing coverage levels, heterozygosity rates, and sequencing error rates; (3) one real metagenomic data set; and (4) five synthetic metagenomic data sets with different composition abundance and heterozygosity rates. The 11 assemblers were evaluated using quality assessment tool (QUAST) and benchmarking universal single-copy ortholog (BUSCO). We also used several additional criteria, namely, completion rate, single-copy completion rate, duplicated completion rate, average proportion of largest category, average distance difference, quality value, run-time, and memory utilization. Results show that hifiasm and hifiasm-meta should be the first choice for assembling eukaryotic genomes and metagenomes with HiFi data. We performed a comprehensive benchmarking study of commonly used assemblers on complex eukaryotic genomes and metagenomes. Our study will help the research community to choose the most appropriate assembler for their data and identify possible improvements in assembly algorithms.


Asunto(s)
Metagenoma , Programas Informáticos , Análisis de Secuencia de ADN/métodos , Algoritmos , Metagenómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
2.
J Biol Chem ; 299(11): 105313, 2023 11.
Artículo en Inglés | MEDLINE | ID: mdl-37797695

RESUMEN

Effective and safe therapies for the treatment of diseases caused by intraerythrocytic parasites are impeded by the rapid emergence of drug resistance and the lack of novel drug targets. One such disease is human babesiosis, which is a rapidly emerging tick-borne illness caused by Babesia parasites. In this study, we identified fosinopril, a phosphonate-containing, FDA-approved angiotensin converting enzyme (ACE) inhibitor commonly used as a prodrug for hypertension and heart failure, as a potent inhibitor of Babesia duncani parasite development within human erythrocytes. Cell biological and mass spectrometry analyses revealed that the conversion of fosinopril to its active diacid molecule, fosinoprilat, is essential for its antiparasitic activity. We show that this conversion is mediated by a parasite-encoded esterase, BdFE1, which is highly conserved among apicomplexan parasites. Parasites carrying the L238H mutation in the active site of BdFE1 failed to convert the prodrug to its active moiety and became resistant to the drug. Our data set the stage for the development of this class of drugs for the therapy of vector-borne parasitic diseases.


Asunto(s)
Babesia , Parásitos , Profármacos , Animales , Humanos , Inhibidores de la Enzima Convertidora de Angiotensina/farmacología , Fosinopril/farmacología , Profármacos/farmacología , Esterasas/metabolismo
3.
PLoS Pathog ; 18(10): e1010869, 2022 10.
Artículo en Inglés | MEDLINE | ID: mdl-36215336

RESUMEN

Natural isolates of the potato and tomato pathogen Phytophthora infestans exhibit substantial variation in virulence, chemical sensitivity, ploidy, and other traits. A chromosome-scale assembly was developed to expand genomic resources for this oomyceteous microbe, and used to explore the basis of variation. Using PacBio and Illumina data, a long-range linking library, and an optical map, an assembly was created and coalesced into 15 pseudochromosomes spanning 219 Mb using SNP-based genetic linkage data. De novo gene prediction combined with transcript evidence identified 19,981 protein-coding genes, plus about eight thousand tRNA genes. The chromosomes were comprised of a mosaic of gene-rich and gene-sparse regions plus very long centromeres. Genes exhibited a biased distribution across chromosomes, especially members of families encoding RXLR and CRN effectors which clustered on certain chromosomes. Strikingly, half of F1 progeny of diploid parents were polyploid or aneuploid. Substantial expression level polymorphisms between strains were identified, much of which could be attributed to differences in chromosome dosage, transposable element insertions, and adjacency to repetitive DNA. QTL analysis identified a locus on the right arm of chromosome 3 governing sensitivity to the crop protection chemical metalaxyl. Strains heterozygous for resistance often experienced megabase-sized deletions of that part of the chromosome when cultured on metalaxyl, increasing resistance due to loss of the sensitive allele. This study sheds light on diverse phenomena affecting variation in P. infestans and relatives, helps explain the prevalence of polyploidy in natural populations, and provides a new foundation for biologic and genetic investigations.


Asunto(s)
Productos Biológicos , Phytophthora infestans , Solanum tuberosum , Humanos , Phytophthora infestans/genética , Elementos Transponibles de ADN , Solanum tuberosum/genética , Cariotipo
4.
Bioinformatics ; 36(24): 5610-5617, 2021 Apr 05.
Artículo en Inglés | MEDLINE | ID: mdl-33367499

RESUMEN

MOTIVATION: Histone post-translational modifications (PTMs) are involved in a variety of essential regulatory processes in the cell, including transcription control. Recent studies have shown that histone PTMs can be accurately predicted from the knowledge of transcription factor binding or DNase hypersensitivity data. Similarly, it has been shown that one can predict PTMs from the underlying DNA primary sequence. RESULTS: In this study, we introduce a deep learning architecture called DeepPTM for predicting histone PTMs from transcription factor binding data and the primary DNA sequence. Extensive experimental results show that our deep learning model outperforms the prediction accuracy of the model proposed in Benveniste et al. (PNAS 2014) and DeepHistone (BMC Genomics 2019). The competitive advantage of our framework lies in the synergistic use of deep learning combined with an effective pre-processing step. Our classification framework has also enabled the discovery that the knowledge of a small subset of transcription factors (which are histone-PTM and cell-type-specific) can provide almost the same prediction accuracy that can be obtained using all the transcription factors data. AVAILABILITYAND IMPLEMENTATION: https://github.com/dDipankar/DeepPTM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

5.
BMC Bioinformatics ; 22(1): 502, 2021 Oct 16.
Artículo en Inglés | MEDLINE | ID: mdl-34656081

RESUMEN

BACKGROUND: The pan-genome of a species is the union of the genes and non-coding sequences present in all individuals (cultivar, accessions, or strains) within that species. RESULTS: Here we introduce PGV, a reference-agnostic representation of the pan-genome of a species based on the notion of consensus ordering. Our experimental results demonstrate that PGV enables an intuitive, effective and interactive visualization of a pan-genome by providing a genome browser that can elucidate complex structural genomic variations. CONCLUSIONS: The PGV software can be installed via conda or downloaded from https://github.com/ucrbioinfo/PGV . The companion PGV browser at http://pgv.cs.ucr.edu can be tested using example bed tracks available from the GitHub page.


Asunto(s)
Genoma , Genómica , Biología Computacional , Humanos , Programas Informáticos
6.
BMC Bioinformatics ; 21(Suppl 14): 367, 2020 Sep 30.
Artículo en Inglés | MEDLINE | ID: mdl-32998698

RESUMEN

BACKGROUND: Essential genes are those genes that are critical for the survival of an organism. The prediction of essential genes in bacteria can provide targets for the design of novel antibiotic compounds or antimicrobial strategies. RESULTS: We propose a deep neural network for predicting essential genes in microbes. Our architecture called DEEPLYESSENTIAL makes minimal assumptions about the input data (i.e., it only uses gene primary sequence and the corresponding protein sequence) to carry out the prediction thus maximizing its practical application compared to existing predictors that require structural or topological features which might not be readily available. We also expose and study a hidden performance bias that effected previous classifiers. Extensive results show that DEEPLYESSENTIAL outperform existing classifiers that either employ down-sampling to balance the training set or use clustering to exclude multiple copies of orthologous genes. CONCLUSION: Deep neural network architectures can efficiently predict whether a microbial gene is essential (or not) using only its sequence information.


Asunto(s)
Bacterias/genética , Genes Esenciales , Redes Neurales de la Computación , Área Bajo la Curva , Análisis por Conglomerados , Codón , Bacterias Gramnegativas/genética , Bacterias Grampositivas/genética , Curva ROC
7.
Plant J ; 98(5): 767-782, 2019 06.
Artículo en Inglés | MEDLINE | ID: mdl-31017340

RESUMEN

Cowpea (Vigna unguiculata [L.] Walp.) is a major crop for worldwide food and nutritional security, especially in sub-Saharan Africa, that is resilient to hot and drought-prone environments. An assembly of the single-haplotype inbred genome of cowpea IT97K-499-35 was developed by exploiting the synergies between single-molecule real-time sequencing, optical and genetic mapping, and an assembly reconciliation algorithm. A total of 519 Mb is included in the assembled sequences. Nearly half of the assembled sequence is composed of repetitive elements, which are enriched within recombination-poor pericentromeric regions. A comparative analysis of these elements suggests that genome size differences between Vigna species are mainly attributable to changes in the amount of Gypsy retrotransposons. Conversely, genes are more abundant in more distal, high-recombination regions of the chromosomes; there appears to be more duplication of genes within the NBS-LRR and the SAUR-like auxin superfamilies compared with other warm-season legumes that have been sequenced. A surprising outcome is the identification of an inversion of 4.2 Mb among landraces and cultivars, which includes a gene that has been associated in other plants with interactions with the parasitic weed Striga gesnerioides. The genome sequence facilitated the identification of a putative syntelog for multiple organ gigantism in legumes. A revised numbering system has been adopted for cowpea chromosomes based on synteny with common bean (Phaseolus vulgaris). An estimate of nuclear genome size of 640.6 Mbp based on cytometry is presented.


Asunto(s)
Cromosomas de las Plantas/genética , Genes de Plantas/genética , Tamaño del Genoma/genética , Genoma de Planta/genética , Vigna/genética , Mapeo Cromosómico , ADN de Plantas/química , ADN de Plantas/genética , Phaseolus/genética , Retroelementos/genética , Análisis de Secuencia de ADN/métodos , Sintenía
8.
Bioinformatics ; 35(10): 1760-1762, 2019 05 15.
Artículo en Inglés | MEDLINE | ID: mdl-30295726

RESUMEN

SUMMARY: A chimeric contig is contig that has been incorrectly assembled, i.e. a contig that contains one or more mis-joins. The detection of chimeric contigs can be carried out either by aligning assembled contigs to genome-wide maps (e.g. genetic, physical or optical maps) or by mapping sequenced reads to the assembled contigs. Here, we introduce a software tool called Chimericognizer that takes advantage of one or more Bionano Genomics optical maps to accurately detect and correct chimeric contigs. Experimental results show that Chimericognizer is very accurate, and significantly better than the chimeric detection method offered by the Bionano Hybrid Scaffold pipeline. Chimericognizer can also detect and correct chimeric optical molecules. AVAILABILITY AND IMPLEMENTATION: https://github.com/ucrbioinfo/Chimericognizer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica , Mapeo Restrictivo , Programas Informáticos , Mapeo Cromosómico , Mapeo Contig , Análisis de Secuencia de ADN
9.
Bioinformatics ; 35(14): i145-i153, 2019 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-31510653

RESUMEN

MOTIVATION: High-throughput conformation capture experiments, such as Hi-C provide genome-wide maps of chromatin interactions, enabling life scientists to investigate the role of the three-dimensional structure of genomes in gene regulation and other essential cellular functions. A fundamental problem in the analysis of Hi-C data is how to compare two contact maps derived from Hi-C experiments. Detecting similarities and differences between contact maps are critical in evaluating the reproducibility of replicate experiments and for identifying differential genomic regions with biological significance. Due to the complexity of chromatin conformations and the presence of technology-driven and sequence-specific biases, the comparative analysis of Hi-C data is analytically and computationally challenging. RESULTS: We present a novel method called Selfish for the comparative analysis of Hi-C data that takes advantage of the structural self-similarity in contact maps. We define a novel self-similarity measure to design algorithms for (i) measuring reproducibility for Hi-C replicate experiments and (ii) finding differential chromatin interactions between two contact maps. Extensive experimental results on simulated and real data show that Selfish is more accurate and robust than state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION: https://github.com/ucrbioinfo/Selfish.


Asunto(s)
Cromatina , Genoma , Mapeo Cromosómico , Genómica , Reproducibilidad de los Resultados
10.
Plant J ; 93(6): 1129-1142, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-29356213

RESUMEN

Multi-parent advanced generation inter-cross (MAGIC) populations are an emerging type of resource for dissecting the genetic structure of traits and improving breeding populations. We developed a MAGIC population for cowpea (Vigna unguiculata L. Walp.) from eight founder parents. These founders were genetically diverse and carried many abiotic and biotic stress resistance, seed quality and agronomic traits relevant to cowpea improvement in the United States and sub-Saharan Africa, where cowpea is vitally important in the human diet and local economies. The eight parents were inter-crossed using structured matings to ensure that the population would have balanced representation from each parent, followed by single-seed descent, resulting in 305 F8 recombinant inbred lines each carrying a mosaic of genome blocks contributed by all founders. This was confirmed by single nucleotide polymorphism genotyping with the Illumina Cowpea Consortium Array. These lines were on average 99.74% homozygous but also diverse in agronomic traits across environments. Quantitative trait loci (QTLs) were identified for several parental traits. Loci with major effects on photoperiod sensitivity and seed size were also verified by biparental genetic mapping. The recombination events were concentrated in telomeric regions. Due to its broad genetic base, this cowpea MAGIC population promises breakthroughs in genetic gain, QTL and gene discovery, enhancement of breeding populations and, for some lines, direct releases as new varieties.


Asunto(s)
Genes de Plantas/genética , Fitomejoramiento/métodos , Sitios de Carácter Cuantitativo/genética , Vigna/genética , Mapeo Cromosómico , Cromosomas de las Plantas/genética , Cruzamientos Genéticos , Genética de Población , Genoma de Planta/genética , Genotipo , Filogenia , Polimorfismo de Nucleótido Simple , Semillas/genética , Especificidad de la Especie , Vigna/clasificación
11.
Bioinformatics ; 34(13): i43-i51, 2018 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-29949964

RESUMEN

Motivation: De novo genome assembly is a challenging computational problem due to the high repetitive content of eukaryotic genomes and the imperfections of sequencing technologies (i.e. sequencing errors, uneven sequencing coverage and chimeric reads). Several assembly tools are currently available, each of which has strengths and weaknesses in dealing with the trade-off between maximizing contiguity and minimizing assembly errors (e.g. mis-joins). To obtain the best possible assembly, it is common practice to generate multiple assemblies from several assemblers and/or parameter settings and try to identify the highest quality assembly. Unfortunately, often there is no assembly that both maximizes contiguity and minimizes assembly errors, so one has to compromise one for the other. Results: The concept of assembly reconciliation has been proposed as a way to obtain a higher quality assembly by merging or reconciling all the available assemblies. While several reconciliation methods have been introduced in the literature, we have shown in one of our recent papers that none of them can consistently produce assemblies that are better than the assemblies provided in input. Here we introduce Novo&Stitch, a novel method that takes advantage of optical maps to accurately carry out assembly reconciliation (assuming that the assembled contigs are sufficiently long to be reliably aligned to the optical maps, e.g. 50 Kbp or longer). Experimental results demonstrate that Novo&Stitch can double the contiguity (N50) of the input assemblies without introducing mis-joins or reducing genome completeness. Availability and implementation: Novo&Stitch can be obtained from https://github.com/ucrbioinfo/Novo_Stitch.


Asunto(s)
Mapeo Contig/métodos , Eucariontes/genética , Genoma , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Phytophthora infestans/genética , Vigna/genética
12.
Metab Eng ; 55: 102-110, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31216436

RESUMEN

Genome-wide mutational screens are central to understanding the genetic underpinnings of evolved and engineered phenotypes. The widespread adoption of CRISPR-Cas9 genome editing has enabled such screens in many organisms, but identifying functional sgRNAs still remains a challenge. Here, we developed a methodology to quantify the cutting efficiency of each sgRNA in a genome-scale library, and in doing so improve screens in the biotechnologically important yeast Yarrowia lipolytica. Screening in the presence and absence of native DNA repair enabled high-throughput quantification of sgRNA function leading to the identification of high efficiency sgRNAs that cover 94% of genes. Library validation enhanced the classification of essential genes by identifying inactive guides that create false negatives and mask the effects of successful disruptions. Quantification of guide effectiveness also creates a dataset from which determinants of CRISPR-Cas9 can be identified. Finally, application of the library identified novel mutations for metabolic engineering of high lipid accumulation.


Asunto(s)
Sistemas CRISPR-Cas , Edición Génica , Biblioteca de Genes , Genes Fúngicos , Yarrowia/genética
13.
Plant J ; 89(5): 1042-1054, 2017 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-27775877

RESUMEN

Cowpea (Vigna unguiculata L. Walp.) is a legume crop that is resilient to hot and drought-prone climates, and a primary source of protein in sub-Saharan Africa and other parts of the developing world. However, genome resources for cowpea have lagged behind most other major crops. Here we describe foundational genome resources and their application to the analysis of germplasm currently in use in West African breeding programs. Resources developed from the African cultivar IT97K-499-35 include a whole-genome shotgun (WGS) assembly, a bacterial artificial chromosome (BAC) physical map, and assembled sequences from 4355 BACs. These resources and WGS sequences of an additional 36 diverse cowpea accessions supported the development of a genotyping assay for 51 128 SNPs, which was then applied to five bi-parental RIL populations to produce a consensus genetic map containing 37 372 SNPs. This genetic map enabled the anchoring of 100 Mb of WGS and 420 Mb of BAC sequences, an exploration of genetic diversity along each linkage group, and clarification of macrosynteny between cowpea and common bean. The SNP assay enabled a diversity analysis of materials from West African breeding programs. Two major subpopulations exist within those materials, one of which has significant parentage from South and East Africa and more diversity. There are genomic regions of high differentiation between subpopulations, one of which coincides with a cluster of nodulin genes. The new resources and knowledge help to define goals and accelerate the breeding of improved varieties to address food security issues related to limited-input small-holder farming and climate stress.


Asunto(s)
Productos Agrícolas/genética , Productos Agrícolas/fisiología , Vigna/genética , Vigna/fisiología , Cromosomas Artificiales Bacterianos , Cromosomas de las Plantas/genética , Clima , Abastecimiento de Alimentos , Genoma de Planta/genética , Genotipo
14.
Bioinformatics ; 32(24): 3823-3825, 2016 12 15.
Artículo en Inglés | MEDLINE | ID: mdl-27540266

RESUMEN

The growing number of metagenomic studies in medicine and environmental sciences is creating increasing demands on the computational infrastructure designed to analyze these very large datasets. Often, the construction of ultra-fast and precise taxonomic classifiers can compromise on their sensitivity (i.e. the number of reads correctly classified). Here we introduce CLARK-S, a new software tool that can classify short reads with high precision, high sensitivity and high speed. AVAILABILITY AND IMPLEMENTATION: CLARK-S is freely available at http://clark.cs.ucr.edu/ CONTACT: stelo@cs.ucr.eduSupplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Metagenómica , Programas Informáticos , Algoritmos , Ecosistema , Humanos
15.
Bioinformatics ; 32(17): 2696-8, 2016 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-27153660

RESUMEN

UNLABELLED: In response to increasing amounts of sequencing data, faster and faster aligners need to become available. Here, we introduce BRAT-nova, a completely rewritten and improved implementation of the mapping tool BRAT-BW for bisulfite-treated reads (BS-Seq). BRAT-nova is very fast and accurate. On the human genome, BRAT-nova is 2-7 times faster than state-of-the-art aligners, while maintaining the same percentage of uniquely mapped reads and space usage. On synthetic reads, BRAT-nova is 2-8 times faster than state-of-the-art aligners while maintaining similar mapping accuracy, methylation call accuracy, methylation level accuracy and space efficiency. AVAILABILITY AND IMPLEMENTATION: The software is available in the public domain at http://compbio.cs.ucr.edu/brat/ CONTACT: elenah@cs.ucr.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Alineación de Secuencia , Análisis de Secuencia de ADN , Programas Informáticos , Mapeo Cromosómico , Metilación de ADN , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos
16.
PLoS Comput Biol ; 12(10): e1005107, 2016 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-27760124

RESUMEN

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.


Asunto(s)
Algoritmos , ADN/genética , Sistemas de Administración de Bases de Datos , Bases de Datos Genéticas , Análisis de Secuencia de ADN/métodos , Programas Informáticos , ADN/química , Análisis Mutacional de ADN/métodos , Minería de Datos/métodos , Aprendizaje Automático , Reconocimiento de Normas Patrones Automatizadas/métodos , Alineación de Secuencia/métodos
17.
Plant J ; 84(1): 216-27, 2015 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-26252423

RESUMEN

Barley (Hordeum vulgare L.) possesses a large and highly repetitive genome of 5.1 Gb that has hindered the development of a complete sequence. In 2012, the International Barley Sequencing Consortium released a resource integrating whole-genome shotgun sequences with a physical and genetic framework. However, because only 6278 bacterial artificial chromosome (BACs) in the physical map were sequenced, fine structure was limited. To gain access to the gene-containing portion of the barley genome at high resolution, we identified and sequenced 15 622 BACs representing the minimal tiling path of 72 052 physical-mapped gene-bearing BACs. This generated ~1.7 Gb of genomic sequence containing an estimated 2/3 of all Morex barley genes. Exploration of these sequenced BACs revealed that although distal ends of chromosomes contain most of the gene-enriched BACs and are characterized by high recombination rates, there are also gene-dense regions with suppressed recombination. We made use of published map-anchored sequence data from Aegilops tauschii to develop a synteny viewer between barley and the ancestor of the wheat D-genome. Except for some notable inversions, there is a high level of collinearity between the two species. The software HarvEST:Barley provides facile access to BAC sequences and their annotations, along with the barley-Ae. tauschii synteny viewer. These BAC sequences constitute a resource to improve the efficiency of marker development, map-based cloning, and comparative genomics in barley and related crops. Additional knowledge about regions of the barley genome that are gene-dense but low recombination is particularly relevant.


Asunto(s)
Cromosomas Artificiales Bacterianos/genética , Genoma de Planta/genética , Hordeum/genética , Datos de Secuencia Molecular
18.
Bioinformatics ; 31(12): i9-16, 2015 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-26072514

RESUMEN

UNLABELLED: We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e. coverage of 1000x or higher). Our proposed meta-assembler Slicembler partitions the input data into optimal-sized 'slices' and uses a standard assembly tool (e.g. Velvet, SPAdes, IDBA_UD and Ray) to assemble each slice individually. Slicembler uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, Slicembler uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8000x coverage) and simulated data show that Slicembler significantly improves the quality of the assembly compared with the performance of the base assembler. In fact, most of the times, Slicembler generates error-free assemblies. We also show that Slicembler is much more resistant against high sequencing error rate than the base assembler. AVAILABILITY AND IMPLEMENTATION: Slicembler can be accessed at http://slicembler.cs.ucr.edu/.


Asunto(s)
Algoritmos , Genes de Plantas , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Hordeum/genética , Análisis de Secuencia de ADN/métodos
19.
Bioinformatics ; 31(18): 2972-80, 2015 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-25995232

RESUMEN

MOTIVATION: As the invention of DNA sequencing in the 70s, computational biologists have had to deal with the problem of de novo genome assembly with limited (or insufficient) depth of sequencing. In this work, we investigate the opposite problem, that is, the challenge of dealing with excessive depth of sequencing. RESULTS: We explore the effect of ultra-deep sequencing data in two domains: (i) the problem of decoding reads to bacterial artificial chromosome (BAC) clones (in the context of the combinatorial pooling design we have recently proposed), and (ii) the problem of de novo assembly of BAC clones. Using real ultra-deep sequencing data, we show that when the depth of sequencing increases over a certain threshold, sequencing errors make these two problems harder and harder (instead of easier, as one would expect with error-free data), and as a consequence the quality of the solution degrades with more and more data. For the first problem, we propose an effective solution based on 'divide and conquer': we 'slice' a large dataset into smaller samples of optimal size, decode each slice independently, and then merge the results. Experimental results on over 15 000 barley BACs and over 4000 cowpea BACs demonstrate a significant improvement in the quality of the decoding and the final assembly. For the second problem, we show for the first time that modern de novo assemblers cannot take advantage of ultra-deep sequencing data. AVAILABILITY AND IMPLEMENTATION: Python scripts to process slices and resolve decoding conflicts are available from http://goo.gl/YXgdHT; software Hashfilter can be downloaded from http://goo.gl/MIyZHs CONTACT: stelo@cs.ucr.edu or timothy.close@ucr.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Fabaceae/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Hordeum/genética , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Cromosomas Artificiales Bacterianos , Alineación de Secuencia
20.
BMC Genomics ; 16: 236, 2015 Mar 25.
Artículo en Inglés | MEDLINE | ID: mdl-25879410

RESUMEN

BACKGROUND: The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce. RESULTS: We introduce CLARK a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of CLARK is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. CLARK can also classify BAC clones or transcripts to chromosome arms and centromeric regions. CONCLUSIONS: CLARK is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/ .


Asunto(s)
Algoritmos , Metagenómica , Genómica , Internet , Interfaz Usuario-Computador
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA