Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 5 de 5
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Ann Appl Stat ; 18(1): 858-881, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38784669

RESUMO

In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the p-value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy).

2.
bioRxiv ; 2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38352303

RESUMO

Polygenic scores (PGSs), increasingly used in clinical settings, frequently include many genetic variants, with performance typically peaking at thousands of variants. Such highly parameterized PGSs often include variants that do not pass a genome-wide significance threshold. We propose a mathematical perspective that renders the effects of many of these non-significant variants random rather than causal, with the randomness capturing population structure. We devise methods to assess variant effect randomness and population stratification bias. Applying these methods to 141 traits from the UK Biobank, we find that, for many PGSs, the effects of non-significant variants are considerably random, with the extent of randomness associated with the degree of overfitting to population structure of the discovery cohort. Our findings explain why highly parameterized PGSs simultaneously have superior cohort-specific performance and limited generalizability, suggesting the critical need for variant randomness tests in PGS evaluation. Supporting code and a dashboard are available at https://github.com/songlab-cal/StratPGS.

3.
bioRxiv ; 2024 Apr 06.
Artigo em Inglês | MEDLINE | ID: mdl-37873118

RESUMO

Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA language models that leverages whole-genome sequence alignments across multiple species and takes only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, OMIM), experimental functional assays (DMS, DepMap), and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and non-coding variants.

4.
Nat Commun ; 9(1): 2077, 2018 05 25.
Artigo em Inglês | MEDLINE | ID: mdl-29802241

RESUMO

In human populations, changes in genetic variation are driven not only by genetic processes, but can also arise from cultural or social changes. An abrupt population bottleneck specific to human males has been inferred across several Old World (Africa, Europe, Asia) populations 5000-7000 BP. Here, bringing together anthropological theory, recent population genomic studies and mathematical models, we propose a sociocultural hypothesis, involving the formation of patrilineal kin groups and intergroup competition among these groups. Our analysis shows that this sociocultural hypothesis can explain the inference of a population bottleneck. We also show that our hypothesis is consistent with current findings from the archaeogenetics of Old World Eurasia, and is important for conceptions of cultural and social evolution in prehistory.


Assuntos
Cromossomos Humanos Y/genética , Características Culturais , Variação Genética/genética , Hierarquia Social , Modelos Genéticos , África , Ásia , Simulação por Computador , DNA Mitocondrial/genética , Europa (Continente) , Haplótipos/genética , Humanos , Masculino , Dinâmica Populacional
5.
J Math Biol ; 77(3): 711-737, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-29569105

RESUMO

The homozygosity and the frequency of the most frequent allele at a polymorphic genetic locus have a close mathematical relationship, so that each quantity places a tight constraint on the other. We use the theory of majorization to provide a simplified derivation of the bounds on homozygosity J in terms of the frequency M of the most frequent allele. The method not only enables simpler derivations of known bounds on J in terms of M, it also produces analogous bounds on entropy statistics for genetic diversity and on homozygosity-like statistics that range in their emphasis on the most frequent allele in relation to other alleles. We illustrate the constraints on the statistics using data from human populations. The approach suggests the potential of the majorization method as a tool for deriving inequalities that characterize mathematical relationships between statistics in population genetics.


Assuntos
Variação Genética , Genética Populacional/estatística & dados numéricos , Modelos Genéticos , Simulação por Computador , Frequência do Gene , Homozigoto , Humanos , Conceitos Matemáticos , Repetições de Microssatélites
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...