RESUMO
BACKGROUND: Polygenic risk scores (PRS) derived from European individuals have reduced portability across global populations, limiting their clinical implementation at worldwide scale. Here, we investigate the performance of a wide range of PRS models across four ancestry groups (Africans, Europeans, East Asians, and South Asians) for 14 conditions of high-medical interest. METHODS: To select the best-performing model per trait, we first compared PRS performances for publicly available scores, and constructed new models using different methods (LDpred2, PRS-CSx and SNPnet). We used 285 K European individuals from the UK Biobank (UKBB) for training and 18 K, including diverse ancestries, for testing. We then evaluated PRS portability for the best models in Europeans and compared their accuracies with respect to the best PRS per ancestry. Finally, we validated the selected PRS models using an independent set of 8,417 individuals from Biobank of the Americas-Genomelink (BbofA-GL); and performed a PRS-Phewas. RESULTS: We confirmed a decay in PRS performances relative to Europeans when the evaluation was conducted using the best-PRS model for Europeans (51.3% for South Asians, 46.6% for East Asians and 39.4% for Africans). We observed an improvement in the PRS performances when specifically selecting ancestry specific PRS models (phenotype variance increase: 1.62 for Africans, 1.40 for South Asians and 0.96 for East Asians). Additionally, when we selected the optimal model conditional on ancestry for CAD, HDL-C and LDL-C, hypertension, hypothyroidism and T2D, PRS performance for studied populations was more comparable to what was observed in Europeans. Finally, we were able to independently validate tested models for Europeans, and conducted a PRS-Phewas, identifying cross-trait interplay between cardiometabolic conditions, and between immune-mediated components. CONCLUSION: Our work comprehensively evaluated PRS accuracy across a wide range of phenotypes, reducing the uncertainty with respect to which PRS model to choose and in which ancestry group. This evaluation has let us identify specific conditions where implementing risk-prioritization strategies could have practical utility across diverse ancestral groups, contributing to democratizing the implementation of PRS.
Assuntos
Predisposição Genética para Doença , Estratificação de Risco Genético , Feminino , Humanos , Povo Asiático/genética , Estudo de Associação Genômica Ampla , Modelos Genéticos , Polimorfismo de Nucleotídeo Único , População Branca/genética , População Negra/genéticaRESUMO
MOTIVATION: Local ancestry inference (LAI) is the high resolution prediction of ancestry labels along a DNA sequence. LAI is important in the study of human history and migrations, and it is beginning to play a role in precision medicine applications including ancestry-adjusted genome-wide association studies (GWASs) and polygenic risk scores (PRSs). Existing LAI models do not generalize well between species, chromosomes or even ancestry groups, requiring re-training for each different setting. Furthermore, such methods can lack interpretability, which is an important element in each of these applications. RESULTS: We present SALAI-Net, a portable statistical LAI method that can be applied on any set of species and ancestries (species-agnostic), requiring only haplotype data and no other biological parameters. Inspired by identity by descent methods, SALAI-Net estimates population labels for each segment of DNA by performing a reference matching approach, which leads to an interpretable and fast technique. We benchmark our models on whole-genome data of humans and we test these models' ability to generalize to dog breeds when trained on human data. SALAI-Net outperforms previous methods in terms of balanced accuracy, while generalizing between different settings, species and datasets. Moreover, it is up to two orders of magnitude faster and uses considerably less RAM memory than competing methods. AVAILABILITY AND IMPLEMENTATION: We provide an open source implementation and links to publicly available data at github.com/AI-sandbox/SALAI-Net. Data is publicly available as follows: https://www.internationalgenome.org (1000 Genomes), https://www.simonsfoundation.org/simons-genome-diversity-project (Simons Genome Diversity Project), https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html (HapMap), ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516 (Human Genome Diversity Project) and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA448733 (Canid genomes). SUPPLEMENTARY INFORMATION: Supplementary data are available from Bioinformatics online.
Assuntos
Estudo de Associação Genômica Ampla , Animais , Cães , Haplótipos , HumanosRESUMO
The estimation of genetic clusters using genomic data has application from genome-wide association studies (GWAS) to demographic history to polygenic risk scores (PRS) and is expected to play an important role in the analyses of increasingly diverse, large-scale cohorts. However, existing methods are computationally-intensive, prohibitively so in the case of nationwide biobanks. Here we explore Archetypal Analysis as an efficient, unsupervised approach for identifying genetic clusters and for associating individuals with them. Such unsupervised approaches help avoid conflating socially constructed ethnic labels with genetic clusters by eliminating the need for exogenous training labels. We show that Archetypal Analysis yields similar cluster structure to existing unsupervised methods such as ADMIXTURE and provides interpretative advantages. More importantly, we show that since Archetypal Analysis can be used with lower-dimensional representations of genetic data, significant reductions in computational time and memory requirements are possible. When Archetypal Analysis is run in such a fashion, it takes several orders of magnitude less compute time than the current standard, ADMIXTURE. Finally, we demonstrate uses ranging across datasets from humans to canids.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Predisposição Genética para Doença , Genética Populacional , Genoma , Genômica/métodos , Humanos , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
The lack of diversity in genomic datasets, currently skewed towards individuals of European ancestry, presents a challenge in developing inclusive biomedical models. The scarcity of such data is particularly evident in labeled datasets that include genomic data linked to electronic health records. To address this gap, this paper presents PopGenAdapt, a genotype-to-phenotype prediction model which adopts semi-supervised domain adaptation (SSDA) techniques originally proposed for computer vision. PopGenAdapt is designed to leverage the substantial labeled data available from individuals of European ancestry, as well as the limited labeled and the larger amount of unlabeled data from currently underrepresented populations. The method is evaluated in underrepresented populations from Nigeria, Sri Lanka, and Hawaii for the prediction of several disease outcomes. The results suggest a significant improvement in the performance of genotype-to-phenotype models for these populations over state-of-the-art supervised learning methods, setting SSDA as a promising strategy for creating more inclusive machine learning models in biomedical research.Our code is available at https://github.com/AI-sandbox/PopGenAdapt.
Assuntos
Pesquisa Biomédica , Biologia Computacional , Humanos , Registros Eletrônicos de Saúde , Fenótipo , Genótipo , Aprendizado de Máquina SupervisionadoRESUMO
The SARS-CoV-2 pandemic has differentially impacted populations across race and ethnicity. A multi-omic approach represents a powerful tool to examine risk across multi-ancestry genomes. We leverage a pandemic tracking strategy in which we sequence viral and host genomes and transcriptomes from nasopharyngeal swabs of 1049 individuals (736 SARS-CoV-2 positive and 313 SARS-CoV-2 negative) and integrate them with digital phenotypes from electronic health records from a diverse catchment area in Northern California. Genome-wide association disaggregated by admixture mapping reveals novel COVID-19-severity-associated regions containing previously reported markers of neurologic, pulmonary and viral disease susceptibility. Phylodynamic tracking of consensus viral genomes reveals no association with disease severity or inferred ancestry. Summary data from multiomic investigation reveals metagenomic and HLA associations with severe COVID-19. The wealth of data available from residual nasopharyngeal swabs in combination with clinical data abstracted automatically at scale highlights a powerful strategy for pandemic tracking, and reveals distinct epidemiologic, genetic, and biological associations for those at the highest risk.
Assuntos
COVID-19 , Pandemias , COVID-19/epidemiologia , Genoma Viral , Estudo de Associação Genômica Ampla , Humanos , SARS-CoV-2/genéticaRESUMO
Purpose: Fluorescence microscopy visualizes three-dimensional subcellular structures in tissue with two-photon microscopy achieving deeper penetration into tissue. Nuclei detection, which is essential for analyzing tissue for clinical and research purposes, remains a challenging problem due to the spatial variability of nuclei. Recent advancements in deep learning techniques have enabled the analysis of fluorescence microscopy data to localize and segment nuclei. However, these localization or segmentation techniques would require additional steps to extract characteristics of nuclei. We develop a 3D convolutional neural network, called Sphere Estimation Network (SphEsNet), to extract characteristics of nuclei without any postprocessing steps. Approach: To simultaneously estimate the center locations of nuclei and their sizes, SphEsNet is composed of two branches to localize nuclei center coordinates and to estimate their radii. Synthetic microscopy volumes automatically generated using a spatially constrained cycle-consistent adversarial network are used for training the network because manually generating 3D real ground truth volumes would be extremely tedious. Results: Three SphEsNet models based on the size of nuclei were trained and tested on five real fluorescence microscopy data sets from rat kidney and mouse intestine. Our method can successfully detect nuclei in multiple locations with various sizes. In addition, our method was compared with other techniques and outperformed them based on object-level precision, recall, and F 1 score. Our model achieved 89.90% for F 1 score. Conclusions: SphEsNet can simultaneously localize nuclei and estimate their size without additional steps. SphEsNet can be potentially used to extract more information from nuclei in fluorescence microscopy images.