Search | Virtual Health Library

An assessment of true and false positive detection rates of stepwise epistatic model selection as a function of sample size and number of markers.

Chen, Angela H; Ge, Weihao; Metcalf, William; Jakobsson, Eric; Mainzer, Liudmila Sergeevna; Lipka, Alexander E.

Heredity (Edinb) ; 122(5): 660-671, 2019 05.

Article in English | MEDLINE | ID: mdl-30443009

ABSTRACT

Association studies have been successful at identifying genomic regions associated with important traits, but routinely employ models that only consider the additive contribution of an individual marker. Because quantitative trait variability typically arises from multiple additive and non-additive sources, utilization of statistical approaches that include main and two-way interaction marker effects of several loci in one model could lead to unprecedented characterization of these sources. Here we examine the ability of one such approach, called the Stepwise Procedure for constructing an Additive and Epistatic Multi-Locus model (SPAEML), to detect additive and epistatic signals simulated using maize and human marker data. Our results revealed that SPAEML was capable of detecting quantitative trait nucleotides (QTNs) at sample sizes as low as n = 300 and consistently specifying signals as additive and epistatic for larger sizes. Sample size and minor allele frequency had a major influence on SPAEML's ability to distinguish between additive and epistatic signals, while the number of markers tested did not. We conclude that SPAEML is a useful approach for providing further elucidation of the additive and epistatic sources contributing to trait variability when applied to a small subset of genome-wide markers located within specific genomic regions identified using a priori analyses.

Subject(s)

Epistasis, Genetic , Genome-Wide Association Study/methods , Models, Genetic , Quantitative Trait Loci/genetics , Chromosome Mapping , Gene Frequency , Genetic Markers/genetics , Genetic Variation , Humans , Phenotype , Sample Size , Zea mays/genetics

Identification of missing variants by combining multiple analytic pipelines.

Ren, Yingxue; Reddy, Joseph S; Pottier, Cyril; Sarangi, Vivekananda; Tian, Shulan; Sinnwell, Jason P; McDonnell, Shannon K; Biernacka, Joanna M; Carrasquillo, Minerva M; Ross, Owen A; Ertekin-Taner, Nilüfer; Rademakers, Rosa; Hudson, Matthew; Mainzer, Liudmila Sergeevna; Asmann, Yan W.

BMC Bioinformatics ; 19(1): 139, 2018 04 16.

Article in English | MEDLINE | ID: mdl-29661148

ABSTRACT

BACKGROUND: After decades of identifying risk factors using array-based genome-wide association studies (GWAS), genetic research of complex diseases has shifted to sequencing-based rare variants discovery. This requires large sample sizes for statistical power and has brought up questions about whether the current variant calling practices are adequate for large cohorts. It is well-known that there are discrepancies between variants called by different pipelines, and that using a single pipeline always misses true variants exclusively identifiable by other pipelines. Nonetheless, it is common practice today to call variants by one pipeline due to computational cost and assume that false negative calls are a small percent of total. RESULTS: We analyzed 10,000 exomes from the Alzheimer's Disease Sequencing Project (ADSP) using multiple analytic pipelines consisting of different read aligners and variant calling strategies. We compared variants identified by using two aligners in 50,100, 200, 500, 1000, and 1952 samples; and compared variants identified by adding single-sample genotyping to the default multi-sample joint genotyping in 50,100, 500, 2000, 5000 and 10,000 samples. We found that using a single pipeline missed increasing numbers of high-quality variants correlated with sample sizes. By combining two read aligners and two variant calling strategies, we rescued 30% of pass-QC variants at sample size of 2000, and 56% at 10,000 samples. The rescued variants had higher proportions of low frequency (minor allele frequency [MAF] 1-5%) and rare (MAF < 1%) variants, which are the very type of variants of interest. In 660 Alzheimer's disease cases with earlier onset ages of ≤65, 4 out of 13 (31%) previously-published rare pathogenic and protective mutations in APP, PSEN1, and PSEN2 genes were undetected by the default one-pipeline approach but recovered by the multi-pipeline approach. CONCLUSIONS: Identification of the complete variant set from sequencing data is the prerequisite of genetic association analyses. The current analytic practice of calling genetic variants from sequencing data using a single bioinformatics pipeline is no longer adequate with the increasingly large projects. The number and percentage of quality variants that passed quality filters but are missed by the one-pipeline approach rapidly increased with sample size.

Subject(s)

Computational Biology/methods , Genetic Variation , Alzheimer Disease/genetics , Base Composition/genetics , Drug Discovery , Genome , Genotype , Genotyping Techniques , Humans , Sample Size , Sequence Alignment

Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics.

Baichoo, Shakuntala; Souilmi, Yassine; Panji, Sumir; Botha, Gerrit; Meintjes, Ayton; Hazelhurst, Scott; Bendou, Hocine; Beste, Eugene de; Mpangase, Phelelani T; Souiai, Oussema; Alghali, Mustafa; Yi, Long; O'Connor, Brian D; Crusoe, Michael; Armstrong, Don; Aron, Shaun; Joubert, Fourie; Ahmed, Azza E; Mbiyavanga, Mamana; Heusden, Peter van; Magosi, Lerato E; Zermeno, Jennie; Mainzer, Liudmila Sergeevna; Fadlelmola, Faisal M; Jongeneel, C Victor; Mulder, Nicola.

BMC Bioinformatics ; 19(1): 457, 2018 Nov 29.

Article in English | MEDLINE | ID: mdl-30486782

ABSTRACT

BACKGROUND: The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging. RESULTS: H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community. CONCLUSION: The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network.

Subject(s)

Computational Biology/methods , Genomics/methods , Africa , Humans , Reproducibility of Results

Assessing computational genomics skills: Our experience in the H3ABioNet African bioinformatics network.

Jongeneel, C Victor; Achinike-Oduaran, Ovokeraye; Adebiyi, Ezekiel; Adebiyi, Marion; Adeyemi, Seun; Akanle, Bola; Aron, Shaun; Ashano, Efejiro; Bendou, Hocine; Botha, Gerrit; Chimusa, Emile; Choudhury, Ananyo; Donthu, Ravikiran; Drnevich, Jenny; Falola, Oluwadamila; Fields, Christopher J; Hazelhurst, Scott; Hendry, Liesl; Isewon, Itunuoluwa; Khetani, Radhika S; Kumuthini, Judit; Kimuda, Magambo Phillip; Magosi, Lerato; Mainzer, Liudmila Sergeevna; Maslamoney, Suresh; Mbiyavanga, Mamana; Meintjes, Ayton; Mugutso, Danny; Mpangase, Phelelani; Munthali, Richard; Nembaware, Victoria; Ndhlovu, Andrew; Odia, Trust; Okafor, Adaobi; Oladipo, Olaleye; Panji, Sumir; Pillay, Venesa; Rendon, Gloria; Sengupta, Dhriti; Mulder, Nicola.

PLoS Comput Biol ; 13(6): e1005419, 2017 Jun.

Article in English | MEDLINE | ID: mdl-28570565

ABSTRACT

The H3ABioNet pan-African bioinformatics network, which is funded to support the Human Heredity and Health in Africa (H3Africa) program, has developed node-assessment exercises to gauge the ability of its participating research and service groups to analyze typical genome-wide datasets being generated by H3Africa research groups. We describe a framework for the assessment of computational genomics analysis skills, which includes standard operating procedures, training and test datasets, and a process for administering the exercise. We present the experiences of 3 research groups that have taken the exercise and the impact on their ability to manage complex projects. Finally, we discuss the reasons why many H3ABioNet nodes have declined so far to participate and potential strategies to encourage them to do so.

Subject(s)

Black People/genetics , Databases, Genetic , Genomics/methods , Database Management Systems , Developing Countries , Humans , Nigeria , South Africa

Two-step light gradient boosted model to identify human west nile virus infection risk factor in Chicago.

Wan, Guangya; Allen, Joshua; Ge, Weihao; Rawlani, Shubham; Uelmen, John; Mainzer, Liudmila Sergeevna; Smith, Rebecca Lee.

PLoS One ; 19(1): e0296283, 2024.

Article in English | MEDLINE | ID: mdl-38181002

ABSTRACT

West Nile virus (WNV), a flavivirus transmitted by mosquito bites, causes primarily mild symptoms but can also be fatal. Therefore, predicting and controlling the spread of West Nile virus is essential for public health in endemic areas. We hypothesized that socioeconomic factors may influence human risk from WNV. We analyzed a list of weather, land use, mosquito surveillance, and socioeconomic variables for predicting WNV cases in 1-km hexagonal grids across the Chicago metropolitan area. We used a two-stage lightGBM approach to perform the analysis and found that hexagons with incomes above and below the median are influenced by the same top characteristics. We found that weather factors and mosquito infection rates were the strongest common factors. Land use and socioeconomic variables had relatively small contributions in predicting WNV cases. The Light GBM handles unbalanced data sets well and provides meaningful predictions of the risk of epidemic disease outbreaks.

Subject(s)

West Nile Fever , West Nile virus , Humans , West Nile Fever/epidemiology , Chicago/epidemiology , Risk Factors , Disease Outbreaks

Impact of variant-level batch effects on identification of genetic risk factors in large sequencing studies.

Wickland, Daniel P; Ren, Yingxue; Sinnwell, Jason P; Reddy, Joseph S; Pottier, Cyril; Sarangi, Vivekananda; Carrasquillo, Minerva M; Ross, Owen A; Younkin, Steven G; Ertekin-Taner, Nilüfer; Rademakers, Rosa; Hudson, Matthew E; Mainzer, Liudmila Sergeevna; Biernacka, Joanna M; Asmann, Yan W.

PLoS One ; 16(4): e0249305, 2021.

Article in English | MEDLINE | ID: mdl-33861770

ABSTRACT

Genetic studies have shifted to sequencing-based rare variants discovery after decades of success in identifying common disease variants by Genome-Wide Association Studies using Single Nucleotide Polymorphism chips. Sequencing-based studies require large sample sizes for statistical power and therefore often inadvertently introduce batch effects because samples are typically collected, processed, and sequenced at multiple centers. Conventionally, batch effects are first detected and visualized using Principal Components Analysis and then controlled by including batch covariates in the disease association models. For sequencing-based genetic studies, because all variants included in the association analyses have passed sequencing-related quality control measures, this conventional approach treats every variant as equal and ignores the substantial differences still remaining in variant qualities and characteristics such as genotype quality scores, alternative allele fractions (fraction of reads supporting alternative allele at a variant position) and sequencing depths. In the Alzheimer's Disease Sequencing Project (ADSP) exome dataset of 9,904 cases and controls, we discovered hidden variant-level differences between sample batches of three sequencing centers and two exome capture kits. Although sequencing centers were included as a covariate in our association models, we observed differences at the variant level in genotype quality and alternative allele fraction between samples processed by different exome capture kits that significantly impacted both the confidence of variant detection and the identification of disease-associated variants. Furthermore, we found that a subset of top disease-risk variants came exclusively from samples processed by one exome capture kit that was more effective at capturing the alternative alleles compared to the other kit. Our findings highlight the importance of additional variant-level quality control for large sequencing-based genetic studies. More importantly, we demonstrate that automatically filtering out variants with batch differences may lead to false negatives if the batch discordances come largely from quality differences and if the batch-specific variants have better quality.

Subject(s)

Genome-Wide Association Study , High-Throughput Nucleotide Sequencing/methods , Alleles , Alzheimer Disease/genetics , Alzheimer Disease/pathology , Apolipoproteins E/genetics , Databases, Genetic , Exome , Female , Gene Frequency , Genetic Predisposition to Disease , Genotype , Humans , Male , Membrane Transport Proteins/genetics , Mitochondrial Precursor Protein Import Complex Proteins , Polymorphism, Single Nucleotide , Principal Component Analysis , Sequence Analysis, DNA

Organizing and running bioinformatics hackathons within Africa: The H3ABioNet cloud computing experience.

Ahmed, Azza E; Mpangase, Phelelani T; Panji, Sumir; Baichoo, Shakuntala; Souilmi, Yassine; Fadlelmola, Faisal M; Alghali, Mustafa; Aron, Shaun; Bendou, Hocine; De Beste, Eugene; Mbiyavanga, Mamana; Souiai, Oussema; Yi, Long; Zermeno, Jennie; Armstrong, Don; O'Connor, Brian D; Mainzer, Liudmila Sergeevna; Crusoe, Michael R; Meintjes, Ayton; Van Heusden, Peter; Botha, Gerrit; Joubert, Fourie; Jongeneel, C Victor; Hazelhurst, Scott; Mulder, Nicola.

AAS Open Res ; 1: 9, 2018.

Article in English | MEDLINE | ID: mdl-32382696

ABSTRACT

The need for portable and reproducible genomics analysis pipelines is growing globally as well as in Africa, especially with the growth of collaborative projects like the Human Health and Heredity in Africa Consortium (H3Africa). The Pan-African H3Africa Bioinformatics Network (H3ABioNet) recognized the need for portable, reproducible pipelines adapted to heterogeneous compute environments, and for the nurturing of technical expertise in workflow languages and containerization technologies. To address this need, in 2016 H3ABioNet arranged its first Cloud Computing and Reproducible Workflows Hackathon, with the purpose of building key genomics analysis pipelines able to run on heterogeneous computing environments and meeting the needs of H3Africa research projects. This paper describes the preparations for this hackathon and reflects upon the lessons learned about its impact on building the technical and scientific expertise of African researchers. The workflows developed were made publicly available in GitHub repositories and deposited as container images on quay.io.

Health disparities research is enabled by data diversity but requires much tighter integration of collaborative efforts.

Cazier, Jean-Baptiste; Mainzer, Liudmila Sergeevna; Ge, Weihao; Zurauskiene, Justina; Madak-Erdogan, Zeynep.

J Glob Health ; 10(2): 020351, 2020 Dec.

Article in English | MEDLINE | ID: mdl-33214885

Subject(s)

Ethnicity , Health Services Research , Health Status Disparities , Humans

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL