Search | VHL Regional Portal

McAN: an ultrafast haplotype network construction algorithm

Lun Li; Bo Xu; Dongmei Tian; Cuiping Li; Na Li; Anke Wang; Junwei Zhu; Yongbiao Xue; Zhang Zhang; Yiming Bao; Wenming Zhao; Shuhui Song.

Preprint in English | bioRxiv | ID: ppbiorxiv-501111

ABSTRACT

SummaryHaplotype network is becoming popular due to its increasing use in analyzing genealogical relationships of closely related genomes. We newly proposed McAN, a minimum-cost arborescence based haplotype network construction algorithm, by considering mutation spectrum history (mutations in ancestry haplotype should be contained in descendant haplotype), node size (corresponding to sample count for a given node) and sampling time. McAN is two orders of magnitude faster than the state-of-the-art algorithms, making it suitable for analyzation of massive sequences. AvailabilitySource code is written in C/C++ and available at https://github.com/Theory-Lun/McAN and https://ngdc.cncb.ac.cn/biocode/tools/BT007301 under the MIT license. The online web service of McAN is available at https://ngdc.cncb.ac.cn/ncov/online/tool/haplotype. SARS-CoV-2 dataset are available at https://ngdc.cncb.ac.cn/ncov/.

Genomic epidemiology of SARS-CoV-2 in Pakistan

Shuhui Song; Cuiping Li; Lu Kang; Dongmei Tian; Nazish Badar; Wentai Ma; Shilei Zhao; Xuan Jiang; Chun Wang; Yongqiao Sun; Wenjie Li; Meng Lei; Shuangli Li; Qiuhui Qi; Aamer Ikram; Muhammad Salman; Massab Umair; Huma Shireen; Fatima Batool; Bing Zhang; Hua Chen; Yungui Yang; Amir Ali Abbasi; Mingkun Li; Yongbiao Xue; Yiming Bao.

Preprint in English | medRxiv | ID: ppmedrxiv-21255875

ABSTRACT

Pakistan has been severely affected by the COVID-19 pandemic. To investigate the initial introductions and transmissions of the SARS-CoV-2 in the country, we performed the largest genomic epidemiology study of COVID-19 in Pakistan and generated 150 complete SARS-CoV-2 genome sequences from samples collected before June 1, 2020. We identified a total of 347 variants, 29 of which were over-represented in Pakistan. Meanwhile, we found over one thousand intra-host single-nucleotide variants. Several of them occurred concurrently, indicating possible interactions among them. Some of the hypermutable positions were not observed in the polymorphism data, suggesting strong purifying selections. The genomic epidemiology revealed five distinctive spreading clusters. The largest cluster consisted of 74 viruses which were derived from different geographic locations and formed a deep hierarchical structure, indicating an extensive and persistent nation-wide transmission of the virus that was probably contributed by a signature mutation of this cluster. Twenty-eight putative international introductions were identified, several of which were consistent with the epidemiological investigations. No progenies of any of these 150 viruses have been found outside of Pakistan, most likely due to the nonphmarcological intervention to control the virus. This study has inferred the introductions and transmissions of SARS-CoV-2 in Pakistan, which could provide a guidance for an effective strategy for disease control.

The global landscape of SARS-CoV-2 genomes, variants, and haplotypes in 2019nCoVR

Shuhui Song; Lina Ma; Dong Zou; Dongmei Tian; Cuiping Li; Junwei Zhu; Meili Chen; Anke Wang; Yingke Ma; Mengwei Li; Xufei Teng; Ying Cui; Guangya Duan; Mochen Zhang; Tong Jin; Chengmin Shi; Zhenglin Du; Yadong Zhang; Chuandong Liu; Rujiao Li; Jingyao Zeng; Lili Hao; Shuai Jiang; Hua Chen; Dali Han; Jingfa Xiao; Zhang Zhang; Wenming Zhao; Yongbiao Xue; Yiming Bao.

Preprint in English | bioRxiv | ID: ppbiorxiv-273235

ABSTRACT

On 22 January 2020, the National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB), created the 2019 Novel Coronavirus Resource (2019nCoVR), an open-access SARS-CoV-2 information resource. 2019nCoVR features a comprehensive integration of sequence and clinical information for all publicly available SARS-CoV-2 isolates, which are manually curated with value-added annotations and quality evaluated by our in-house automated pipeline. Of particular note, 2019nCoVR performs systematic analyses to generate a dynamic landscape of SARS-CoV-2 genomic variations at a global scale. It provides all identified variants and detailed statistics for each virus isolate, and congregates the quality score, functional annotation, and population frequency for each variant. It also generates visualization of the spatiotemporal change for each variant and yields historical viral haplotype network maps for the course of the outbreak from all complete and high-quality genomes. Moreover, 2019nCoVR provides a full collection of SARS-CoV-2 relevant literature on COVID-19 (Coronavirus Disease 2019), including published papers from PubMed as well as preprints from services such as bioRxiv and medRxiv through Europe PMC. Furthermore, by linking with relevant databases in CNCB-NGDC, 2019nCoVR offers data submission services for raw sequence reads and assembled genomes, and data sharing with National Center for Biotechnology Information. Collectively, all SARS-CoV-2 genome sequences, variants, haplotypes and literature are updated daily to provide timely information, making 2019nCoVR a valuable resource for the global research community. 2019nCoVR is accessible at https://bigd.big.ac.cn/ncov/.

Compositional Variability and Mutation Spectra of Monophyletic SARS-CoV-2 Clades

Xufei Teng; Qianpeng Li; Zhao Li; Yuansheng Zhang; Guangyi Niu; Jingfa Xiao; Jun Yu; Zhang Zhang; Shuhui Song.

Preprint in English | bioRxiv | ID: ppbiorxiv-267781

ABSTRACT

COVID-19 and its causative pathogen SARS-CoV-2 have rushed the world into a staggering pandemic in a few months and a global fight against both is still going on. Here, we describe an analysis procedure where genome composition and its variables are related, through the genetic code, to molecular mechanisms based on understanding of RNA replication and its feedback loop from mutation to viral proteome sequence fraternity including effective sites on replicase-transcriptase complex. Our analysis starts with primary sequence information and identity-based phylogeny based on 22,051 SARS-CoV-2 genome sequences and evaluation of sequence variation patterns as mutation spectrum and its 12 permutations among organized clades tailored to two key mechanisms: strand-biased and function-associated mutations. Our findings include: (1) The most dominant mutation is C-to-U permutation whose abundant second-codon-position counts alter amino acid composition toward higher molecular weight and lower hydrophobicity albeit assumed most slightly deleterious. (2) The second abundance group includes: three negative-strand mutations U-to-C, A-to-G, G-to-A and a positive-strand mutation G-to-U generated through an identical mechanism as C-to-U. (3) A clade-associated and biased mutation trend is found attributable to elevated level of the negative-sense strand synthesis. (4) Within-clade permutation variation is very informative for associating non-synonymous mutations and viral proteome changes. These findings demand a bioinformatics platform where emerging mutations are mapped on to mostly subtle but fast-adjusting viral proteomes and transcriptomes to provide biological and clinical information after logical convergence for effective pharmaceutical and diagnostic applications. Such thoughts and actions are in desperate need, especially in the middle of the War against COVID-19.

Whole Genome Analyses of Chinese Population and De Novo Assembly of A Northern Han Genome / 基因组蛋白质组与生物信息学报·英文版

Zhenglin DU; Liang MA; Hongzhu QU; Wei CHEN; Bing ZHANG; Xi LU; Weibo ZHAI; Xin SHENG; Yongqiao SUN; Wenjie LI; Meng LEI; Qiuhui QI; Na YUAN; Shuo SHI; Jingyao ZENG; Jinyue WANG; Yadong YANG; Qi LIU; Yaqiang HONG; Lili DONG; Zhewen ZHANG; Dong ZOU; Yanqing WANG; Shuhui SONG; Fan LIU; Xiangdong FANG; Hua CHEN; Xin LIU; Jingfa XIAO; Changqing ZENG.

Genomics, Proteomics & Bioinformatics ; (4): 229-247, 2019.

Article in English | WPRIM (Western Pacific) | ID: wpr-772932

ABSTRACT

To unravel the genetic mechanisms of disease and physiological traits, it requires comprehensive sequencing analysis of large sample size in Chinese populations. Here, we report the primary results of the Chinese Academy of Sciences Precision Medicine Initiative (CASPMI) project launched by the Chinese Academy of Sciences, including the de novo assembly of a northern Han reference genome (NH1.0) and whole genome analyses of 597 healthy people coming from most areas in China. Given the two existing reference genomes for Han Chinese (YH and HX1) were both from the south, we constructed NH1.0, a new reference genome from a northern individual, by combining the sequencing strategies of PacBio, 10× Genomics, and Bionano mapping. Using this integrated approach, we obtained an N50 scaffold size of 46.63 Mb for the NH1.0 genome and performed a comparative genome analysis of NH1.0 with YH and HX1. In order to generate a genomic variation map of Chinese populations, we performed the whole-genome sequencing of 597 participants and identified 24.85 million (M) single nucleotide variants (SNVs), 3.85 M small indels, and 106,382 structural variations. In the association analysis with collected phenotypes, we found that the T allele of rs1549293 in KAT8 significantly correlated with the waist circumference in northern Han males. Moreover, significant genetic diversity in MTHFR, TCN2, FADS1, and FADS2, which associate with circulating folate, vitamin B12, or lipid metabolism, was observed between northerners and southerners. Especially, for the homocysteine-increasing allele of rs1801133 (MTHFR 677T), we hypothesize that there exists a "comfort" zone for a high frequency of 677T between latitudes of 35-45 degree North. Taken together, our results provide a high-quality northern Han reference genome and novel population-specific data sets of genetic variants for use in the personalized and precision medicine.

Rice Genomics: over the Past Two Decades and into the Future / 基因组蛋白质组与生物信息学报·英文版

Shuhui SONG; Dongmei TIAN; Zhang ZHANG; Songnian HU; Jun YU.

Genomics, Proteomics & Bioinformatics ; (4): 397-404, 2018.

Article in English | WPRIM (Western Pacific) | ID: wpr-772958

ABSTRACT

Domestic rice (Oryza sativa L.) is one of the most important cereal crops, feeding a large number of worldwide populations. Along with various high-throughput genome sequencing projects, rice genomics has been making great headway toward direct field applications of basic research advances in understanding the molecular mechanisms of agronomical traits and utilizing diverse germplasm resources. Here, we briefly review its achievements over the past two decades and present the potential for its bright future.

Subject(s)

Crops, Agricultural , Genetics , Genome, Plant , Genetics , Genomics , High-Throughput Nucleotide Sequencing , Oryza , Genetics , Phenotype

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL