Pesquisa | Portal de Pesquisa da BVS

SpaRC: scalable sequence clustering using Apache Spark.

Shi, Lizhen; Meng, Xiandong; Tseng, Elizabeth; Mascagni, Michael; Wang, Zhong.

Bioinformatics ; 35(5): 760-768, 2019 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-30816928

RESUMO

MOTIVATION: Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100-1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes. RESULTS: Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems. AVAILABILITY AND IMPLEMENTATION: https://bitbucket.org/berkeleylab/jgi-sparc.

Assuntos

Algoritmos , Software , Análise por Conglomerados , Sequenciamento de Nucleotídeos em Larga Escala , Metagenômica , Análise de Sequência de DNA

Construction and Optimization of a Large Gene Coexpression Network in Maize Using RNA-Seq Data.

Huang, Ji; Vendramin, Stefania; Shi, Lizhen; McGinnis, Karen M.

Plant Physiol ; 175(1): 568-583, 2017 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-28768814

RESUMO

With the emergence of massively parallel sequencing, genomewide expression data production has reached an unprecedented level. This abundance of data has greatly facilitated maize research, but may not be amenable to traditional analysis techniques that were optimized for other data types. Using publicly available data, a gene coexpression network (GCN) can be constructed and used for gene function prediction, candidate gene selection, and improving understanding of regulatory pathways. Several GCN studies have been done in maize (Zea mays), mostly using microarray datasets. To build an optimal GCN from plant materials RNA-Seq data, parameters for expression data normalization and network inference were evaluated. A comprehensive evaluation of these two parameters and a ranked aggregation strategy on network performance, using libraries from 1266 maize samples, were conducted. Three normalization methods and 10 inference methods, including six correlation and four mutual information methods, were tested. The three normalization methods had very similar performance. For network inference, correlation methods performed better than mutual information methods at some genes. Increasing sample size also had a positive effect on GCN. Aggregating single networks together resulted in improved performance compared to single networks.

Assuntos

Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Análise de Sequência de RNA/métodos , Zea mays/genética , Algoritmos , Conjuntos de Dados como Assunto , Análise de Sequência com Séries de Oligonucleotídeos , RNA de Plantas/química , RNA de Plantas/genética

DNABERT-S: LEARNING SPECIES-AWARE DNA EMBEDDING WITH GENOME FOUNDATION MODELS.

Zhou, Zhihan; Wu, Weimin; Ho, Harrison; Wang, Jiayi; Shi, Lizhen; Davuluri, Ramana V; Wang, Zhong; Liu, Han.

ArXiv ; 2024 Feb 15.

Artigo em Inglês | MEDLINE | ID: mdl-38410647

RESUMO

Effective DNA embedding remains crucial in genomic analysis, particularly in scenarios lacking labeled data for model fine-tuning, despite the significant advancements in genome foundation models. A prime example is metagenomics binning, a critical process in microbiome research that aims to group DNA sequences by their species from a complex mixture of DNA sequences derived from potentially thousands of distinct, often uncharacterized species. To fill the lack of effective DNA embedding models, we introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C2LR) strategy. Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance. It outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training while doubling the Adjusted Rand Index (ARI) in species clustering and substantially increasing the number of correctly identified species in metagenomics binning. The code, data, and pre-trained model are publicly available at https://github.com/Zhihan1996/DNABERT_S.

Transcriptome analysis of Actinoplanes utahensis reveals molecular signature of saccharide impact on acarbose biosynthesis.

Weng, Chun-Yue; Shi, Li-Zhen; Wang, Ya-Jun; Zheng, Yu-Guo.

3 Biotech ; 10(11): 473, 2020 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-33088668

RESUMO

Different carbon sources lead to differential acarbose production in Actinoplanes. To uncover the underlying differentiation in the context of genes and pathways, we performed transcriptome sequencing of Actinoplanes utahensis ZJB-03852 grown on different saccharides, such as glucose, maltose, or the saccharide complex consisting of glucose plus maltose. The differentially expressed genes were classified into GO (gene ontology) terms and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways for functional annotations. Key enriched modules were uncovered. Our data revealed that both maltose and its complex with glucose gave improved acarbose titer. Sugar transportation, cytochrome oxidase, protein synthesis and amino acid metabolism modules were enriched under the saccharide complex condition, while ferritin metabolism gene expressions were enriched in the glucose medium. Our results provided the foundation for uncovering the mechanism of carbon source on acarbose production in A. utahensis.

Deconvolute individual genomes from metagenome sequences through short read clustering.

Li, Kexue; Lu, Yakang; Deng, Li; Wang, Lili; Shi, Lizhen; Wang, Zhong.

PeerJ ; 8: e8966, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32296615

RESUMO

Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality.

Computational Strategies for Scalable Genomics Analysis.

Shi, Lizhen; Wang, Zhong.

Genes (Basel) ; 10(12)2019 12 06.

Artigo em Inglês | MEDLINE | ID: mdl-31817630

RESUMO

The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications.

Assuntos

Algoritmos , Biologia Computacional , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Software

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA