Búsqueda | Portal de Búsqueda de la BVS

Bounding measures of genetic similarity and diversity using majorization.

Aw, Alan J; Rosenberg, Noah A.

J Math Biol ; 77(3): 711-737, 2018 09.

Artículo en Inglés | MEDLINE | ID: mdl-29569105

RESUMEN

The homozygosity and the frequency of the most frequent allele at a polymorphic genetic locus have a close mathematical relationship, so that each quantity places a tight constraint on the other. We use the theory of majorization to provide a simplified derivation of the bounds on homozygosity J in terms of the frequency M of the most frequent allele. The method not only enables simpler derivations of known bounds on J in terms of M, it also produces analogous bounds on entropy statistics for genetic diversity and on homozygosity-like statistics that range in their emphasis on the most frequent allele in relation to other alleles. We illustrate the constraints on the statistics using data from human populations. The approach suggests the potential of the majorization method as a tool for deriving inequalities that characterize mathematical relationships between statistics in population genetics.

Asunto(s)

Variación Genética , Genética de Población/estadística & datos numéricos , Modelos Genéticos , Simulación por Computador , Frecuencia de los Genes , Homocigoto , Humanos , Conceptos Matemáticos , Repeticiones de Microsatélite

A SIMPLE AND FLEXIBLE TEST OF SAMPLE EXCHANGEABILITY WITH APPLICATIONS TO STATISTICAL GENOMICS.

Aw, Alan J; Spence, Jeffrey P; Song, Yun S.

Ann Appl Stat ; 18(1): 858-881, 2024 Mar.

Artículo en Inglés | MEDLINE | ID: mdl-38784669

RESUMEN

In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the p-value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy).

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction.

Benegas, Gonzalo; Albors, Carlos; Aw, Alan J; Ye, Chengzhong; Song, Yun S.

bioRxiv ; 2024 Apr 06.

Artículo en Inglés | MEDLINE | ID: mdl-37873118

RESUMEN

Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA language models that leverages whole-genome sequence alignments across multiple species and takes only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, OMIM), experimental functional assays (DMS, DepMap), and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and non-coding variants.

Highly parameterized polygenic scores tend to overfit to population stratification via random effects.

Aw, Alan J; McRae, Jeremy; Rahmani, Elior; Song, Yun S.

bioRxiv ; 2024 Jan 29.

Artículo en Inglés | MEDLINE | ID: mdl-38352303

RESUMEN

Polygenic scores (PGSs), increasingly used in clinical settings, frequently include many genetic variants, with performance typically peaking at thousands of variants. Such highly parameterized PGSs often include variants that do not pass a genome-wide significance threshold. We propose a mathematical perspective that renders the effects of many of these non-significant variants random rather than causal, with the randomness capturing population structure. We devise methods to assess variant effect randomness and population stratification bias. Applying these methods to 141 traits from the UK Biobank, we find that, for many PGSs, the effects of non-significant variants are considerably random, with the extent of randomness associated with the degree of overfitting to population structure of the discovery cohort. Our findings explain why highly parameterized PGSs simultaneously have superior cohort-specific performance and limited generalizability, suggesting the critical need for variant randomness tests in PGS evaluation. Supporting code and a dashboard are available at https://github.com/songlab-cal/StratPGS.

Cultural hitchhiking and competition between patrilineal kin groups explain the post-Neolithic Y-chromosome bottleneck.

Zeng, Tian Chen; Aw, Alan J; Feldman, Marcus W.

Nat Commun ; 9(1): 2077, 2018 05 25.

Artículo en Inglés | MEDLINE | ID: mdl-29802241

RESUMEN

In human populations, changes in genetic variation are driven not only by genetic processes, but can also arise from cultural or social changes. An abrupt population bottleneck specific to human males has been inferred across several Old World (Africa, Europe, Asia) populations 5000-7000 BP. Here, bringing together anthropological theory, recent population genomic studies and mathematical models, we propose a sociocultural hypothesis, involving the formation of patrilineal kin groups and intergroup competition among these groups. Our analysis shows that this sociocultural hypothesis can explain the inference of a population bottleneck. We also show that our hypothesis is consistent with current findings from the archaeogenetics of Old World Eurasia, and is important for conceptions of cultural and social evolution in prehistory.

Asunto(s)

Cromosomas Humanos Y/genética , Características Culturales , Variación Genética/genética , Jerarquia Social , Modelos Genéticos , África , Asia , Simulación por Computador , ADN Mitocondrial/genética , Europa (Continente) , Haplotipos/genética , Humanos , Masculino , Dinámica Poblacional

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA