Pesquisa | Portal de Pesquisa da BVS

A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants.

Wang, Chonghao; Zhang, Jing; Veldsman, Werner Pieter; Zhou, Xin; Zhang, Lu.

Brief Bioinform ; 24(1)2023 01 19.

Artigo em Inglês | MEDLINE | ID: mdl-36585786

RESUMO

Quantifying an individual's risk for common diseases is an important goal of precision health. The polygenic risk score (PRS), which aggregates multiple risk alleles of candidate diseases, has emerged as a standard approach for identifying high-risk individuals. Although several studies have been performed to benchmark the PRS calculation tools and assess their potential to guide future clinical applications, some issues remain to be further investigated, such as lacking (i) various simulated data with different genetic effects; (ii) evaluation of machine learning models and (iii) evaluation on multiple ancestries studies. In this study, we systematically validated and compared 13 statistical methods, 5 machine learning models and 2 ensemble models using simulated data with additive and genetic interaction models, 22 common diseases with internal training sets, 4 common diseases with external summary statistics and 3 common diseases for trans-ancestry studies in UK Biobank. The statistical methods were better in simulated data from additive models and machine learning models have edges for data that include genetic interactions. Ensemble models are generally the best choice by integrating various statistical methods. LDpred2 outperformed the other standalone tools, whereas PRS-CS, lassosum and DBSLMM showed comparable performance. We also identified that disease heritability strongly affected the predictive performance of all methods. Both the number and effect sizes of risk SNPs are important; and sample size strongly influences the performance of all methods. For the trans-ancestry studies, we found that the performance of most methods became worse when training and testing sets were from different populations.

Assuntos

Aprendizado de Máquina , Herança Multifatorial , Humanos , Fatores de Risco , Genômica , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/métodos

Benchmarking genome assembly methods on metagenomic sequencing data.

Zhang, Zhenmiao; Yang, Chao; Veldsman, Werner Pieter; Fang, Xiaodong; Zhang, Lu.

Brief Bioinform ; 24(2)2023 03 19.

Artigo em Inglês | MEDLINE | ID: mdl-36917471

RESUMO

Metagenome assembly is an efficient approach to reconstruct microbial genomes from metagenomic sequencing data. Although short-read sequencing has been widely used for metagenome assembly, linked- and long-read sequencing have shown their advancements in assembly by providing long-range DNA connectedness. Many metagenome assembly tools were developed to simplify the assembly graphs and resolve the repeats in microbial genomes. However, there remains no comprehensive evaluation of metagenomic sequencing technologies, and there is a lack of practical guidance on selecting the appropriate metagenome assembly tools. This paper presents a comprehensive benchmark of 19 commonly used assembly tools applied to metagenomic sequencing datasets obtained from simulation, mock communities or human gut microbiomes. These datasets were generated using mainstream sequencing platforms, such as Illumina and BGISEQ short-read sequencing, 10x Genomics linked-read sequencing, and PacBio and Oxford Nanopore long-read sequencing. The assembly tools were extensively evaluated against many criteria, which revealed that long-read assemblers generated high contig contiguity but failed to reveal some medium- and high-quality metagenome-assembled genomes (MAGs). Linked-read assemblers obtained the highest number of overall near-complete MAGs from the human gut microbiomes. Hybrid assemblers using both short- and long-read sequencing were promising methods to improve both total assembly length and the number of near-complete MAGs. This paper also discussed the running time and peak memory consumption of these assembly tools and provided practical guidance on selecting them.

Assuntos

Metagenoma , Microbiota , Humanos , Benchmarking , Microbiota/genética , Metagenômica/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos

Benchmarking multi-platform sequencing technologies for human genome assembly.

Wang, Jingjing; Veldsman, Werner Pieter; Fang, Xiaodong; Huang, Yufen; Xie, Xuefeng; Lyu, Aiping; Zhang, Lu.

Brief Bioinform ; 24(5)2023 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-37594299

RESUMO

Genome assembly is a computational technique that involves piecing together deoxyribonucleic acid (DNA) fragments generated by sequencing technologies to create a comprehensive and precise representation of the entire genome. Generating a high-quality human reference genome is a crucial prerequisite for comprehending human biology, and it is also vital for downstream genomic variation analysis. Many efforts have been made over the past few decades to create a complete and gapless reference genome for humans by using a diverse range of advanced sequencing technologies. Several available tools are aimed at enhancing the quality of haploid and diploid human genome assemblies, which include contig assembly, polishing of contig errors, scaffolding and variant phasing. Selecting the appropriate tools and technologies remains a daunting task despite several studies have investigated the pros and cons of different assembly strategies. The goal of this paper was to benchmark various strategies for human genome assembly by combining sequencing technologies and tools on two publicly available samples (NA12878 and NA24385) from Genome in a Bottle. We then compared their performances in terms of continuity, accuracy, completeness, variant calling and phasing. We observed that PacBio HiFi long-reads are the optimal choice for generating an assembly with low base errors. On the other hand, we were able to produce the most continuous contigs with Oxford Nanopore long-reads, but they may require further polishing to improve on quality. We recommend using short-reads rather than long-reads themselves to improve the base accuracy of contigs from Oxford Nanopore long-reads. Hi-C is the best choice for chromosome-level scaffolding because it can capture the longest-range DNA connectedness compared to 10× linked-reads and Bionano optical maps. However, a combination of multiple technologies can be used to further improve the quality and completeness of genome assembly. For diploid assembly, hifiasm is the best tool for human diploid genome assembly using PacBio HiFi and Hi-C data. Looking to the future, we expect that further advancements in human diploid assemblers will leverage the power of PacBio HiFi reads and other technologies with long-range DNA connectedness to enable the generation of high-quality, chromosome-level and haplotype-resolved human genome assemblies.

Assuntos

Benchmarking , Genoma Humano , Humanos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , DNA/genética

Comparative genomics of the coconut crab and other decapod crustaceans: exploring the molecular basis of terrestrial adaptation.

Veldsman, Werner Pieter; Ma, Ka Yan; Hui, Jerome Ho Lam; Chan, Ting Fung; Baeza, J Antonio; Qin, Jing; Chu, Ka Hou.

BMC Genomics ; 22(1): 313, 2021 Apr 30.

Artigo em Inglês | MEDLINE | ID: mdl-33931033

RESUMO

BACKGROUND: The complex life cycle of the coconut crab, Birgus latro, begins when an obligate terrestrial adult female visits the intertidal to hatch zoea larvae into the surf. After drifting for several weeks in the ocean, the post-larval glaucothoes settle in the shallow subtidal zone, undergo metamorphosis, and the early juveniles then subsequently make their way to land where they undergo further physiological changes that prevent them from ever entering the sea again. Here, we sequenced, assembled and analyzed the coconut crab genome to shed light on its adaptation to terrestrial life. For comparison, we also assembled the genomes of the long-tailed marine-living ornate spiny lobster, Panulirus ornatus, and the short-tailed marine-living red king crab, Paralithodes camtschaticus. Our selection of the latter two organisms furthermore allowed us to explore parallel evolution of the crab-like form in anomurans. RESULTS: All three assembled genomes are large, repeat-rich and AT-rich. Functional analysis reveals that the coconut crab has undergone proliferation of genes involved in the visual, respiratory, olfactory and cytoskeletal systems. Given that the coconut crab has atypical mitochondrial DNA compared to other anomurans, we argue that an abundance of kif22 and other significantly proliferated genes annotated with mitochondrial and microtubule functions, point to unique mechanisms involved in providing cellular energy via nuclear protein-coding genes supplementing mitochondrial and microtubule function. We furthermore detected in the coconut crab a significantly proliferated HOX gene, caudal, that has been associated with posterior development in Drosophila, but we could not definitively associate this gene with carcinization in the Anomura since it is also significantly proliferated in the ornate spiny lobster. However, a cuticle-associated coatomer gene, gammacop, that is significantly proliferated in the coconut crab, may play a role in hardening of the adult coconut crab abdomen in order to mitigate desiccation in terrestrial environments. CONCLUSION: The abundance of genomic features in the three assembled genomes serve as a source of hypotheses for future studies of anomuran environmental adaptations such as shell-utilization, perception of visual and olfactory cues in terrestrial environments, and cuticle sclerotization. We hypothesize that the coconut crab exhibits gene proliferation in lieu of alternative splicing as a terrestrial adaptation mechanism and propose life-stage transcriptomic assays to test this hypothesis.

Assuntos

Anomuros , Braquiúros , Palinuridae , Animais , Braquiúros/genética , Cocos , Feminino , Genômica

LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome.

Yang, Chao; Zhang, Zhenmiao; Huang, Yufen; Xie, Xuefeng; Liao, Herui; Xiao, Jin; Veldsman, Werner Pieter; Yin, Kejing; Fang, Xiaodong; Zhang, Lu.

Gigascience ; 132024 Jan 02.

Artigo em Inglês | MEDLINE | ID: mdl-38869148

RESUMO

BACKGROUND: Linked-read sequencing technologies generate high-base quality short reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and have been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to 1 specific sequencing platform. FINDINGS: To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome. LRTK provides functions to perform linked-read simulation, barcode sequencing error correction, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically and provides users with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on linked reads from simulation, mock community, and real datasets for both human genome and metagenome. We showcased LRTK's ability to generate comparative performance results from preceding benchmark studies and to report these results in publication-ready HTML document plots. CONCLUSIONS: LRTK provides comprehensive and flexible modules along with an easy-to-use Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.

Assuntos

Genoma Humano , Metagenoma , Metagenômica , Software , Humanos , Metagenômica/métodos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Biologia Computacional/métodos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA