Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
bioRxiv ; 2024 Apr 28.
Artículo en Inglés | MEDLINE | ID: mdl-38712040

RESUMEN

Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable. For example, the UK Biobank polymorphism data of 200,000 phased whole genomes has exceeded 350 terabytes (TB) in Variant Call Format (VCF), too large to fit into hard drives in uncompressed form. To mitigate the computational burden, we introduce the Genotype Representation Graph (GRG), an extremely compact data structure to losslessly present phased whole-genome polymorphisms. A GRG is a fully connected hierarchical graph that exploits variant-sharing across samples, leveraging on ideas inspired by Ancestral Recombination Graphs. Capturing variant-sharing in a graph format compresses biobank-scale data to the point where it can fit in a typical server's RAM (5-26GB per chromosome), and enables graph-traversal algorithms to trivially reuse computed values, both of which can significantly reduce computation time. We have developed a command-line tool and a library usable via both C++ and Python for constructing and processing GRG files which scales to a million whole genomes. It takes 160GB disk space to encode the information in 200,000 UK Biobank phased whole genomes as a GRG, more than 2000 times smaller than the size of VCF. Moreover, the size of GRG increases sublinearly with the number of samples stored, making it a sustainable solution to the increasing number of samples in large datasets. We show that summaries of genetic variants can be computed on GRG via graph traversal that runs 230 times faster than on VCF. We anticipate that GRG-based algorithms will improve the scalability of various types of computation and generally lower the cost of analyzing large genomic datasets.

2.
bioRxiv ; 2023 May 18.
Artículo en Inglés | MEDLINE | ID: mdl-37292742

RESUMEN

Population genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gained popularity due to their ability to generate AGs closely resembling empirical data. These models, however, present a tradeoff between expressivity and tractability. Here, we propose to use hidden Chow-Liu trees (HCLTs) and their representation as probabilistic circuits (PCs) as a solution to this tradeoff. We first learn an HCLT structure that captures the long-range dependencies among SNPs in the training data set. We then convert the HCLT to its equivalent PC as a means of supporting tractable and efficient probabilistic inference. The parameters in these PCs are inferred with an expectation-maximization algorithm using the training data. Compared to other models for generating AGs, HCLT obtains the largest log-likelihood on test genomes across SNPs chosen across the genome and from a contiguous genomic region. Moreover, the AGs generated by HCLT more accurately resemble the source data set in their patterns of allele frequencies, linkage disequilibrium, pairwise haplotype distances, and population structure. This work not only presents a new and robust AG simulator but also manifests the potential of PCs in population genetics.

3.
Elife ; 122023 03 20.
Artículo en Inglés | MEDLINE | ID: mdl-36939312

RESUMEN

The genetic variants introduced into the ancestors of modern humans from interbreeding with Neanderthals have been suggested to contribute an unexpected extent to complex human traits. However, testing this hypothesis has been challenging due to the idiosyncratic population genetic properties of introgressed variants. We developed rigorous methods to assess the contribution of introgressed Neanderthal variants to heritable trait variation and applied these methods to analyze 235,592 introgressed Neanderthal variants and 96 distinct phenotypes measured in about 300,000 unrelated white British individuals in the UK Biobank. Introgressed Neanderthal variants make a significant contribution to trait variation (explaining 0.12% of trait variation on average). However, the contribution of introgressed variants tends to be significantly depleted relative to modern human variants matched for allele frequency and linkage disequilibrium (about 59% depletion on average), consistent with purifying selection on introgressed variants. Different from previous studies (McArthur et al., 2021), we find no evidence for elevated heritability across the phenotypes examined. We identified 348 independent significant associations of introgressed Neanderthal variants with 64 phenotypes. Previous work (Skov et al., 2020) has suggested that a majority of such associations are likely driven by statistical association with nearby modern human variants that are the true causal variants. Applying a customized fine-mapping led us to identify 112 regions across 47 phenotypes containing 4303 unique genetic variants where introgressed variants are highly likely to have a phenotypic effect. Examination of these variants reveals their substantial impact on genes that are important for the immune system, development, and metabolism.


Asunto(s)
Hominidae , Hombre de Neandertal , Animales , Humanos , Hombre de Neandertal/genética , Herencia Multifactorial , Hominidae/genética , Frecuencia de los Genes , Genética de Población , Genoma Humano
4.
Genetics ; 221(1)2022 05 05.
Artículo en Inglés | MEDLINE | ID: mdl-35333304

RESUMEN

The ancestral recombination graph is a structure that describes the joint genealogies of sampled DNA sequences along the genome. Recent computational methods have made impressive progress toward scalably estimating whole-genome genealogies. In addition to inferring the ancestral recombination graph, some of these methods can also provide ancestral recombination graphs sampled from a defined posterior distribution. Obtaining good samples of ancestral recombination graphs is crucial for quantifying statistical uncertainty and for estimating population genetic parameters such as effective population size, mutation rate, and allele age. Here, we use standard neutral coalescent simulations to benchmark the estimates of pairwise coalescence times from 3 popular ancestral recombination graph inference programs: ARGweaver, Relate, and tsinfer+tsdate. We compare (1) the true coalescence times to the inferred times at each locus; (2) the distribution of coalescence times across all loci to the expected exponential distribution; (3) whether the sampled coalescence times have the properties expected of a valid posterior distribution. We find that inferred coalescence times at each locus are most accurate in ARGweaver, and often more accurate in Relate than in tsinfer+tsdate. However, all 3 methods tend to overestimate small coalescence times and underestimate large ones. Lastly, the posterior distribution of ARGweaver is closer to the expected posterior distribution than Relate's, but this higher accuracy comes at a substantial trade-off in scalability. The best choice of method will depend on the number and length of input sequences and on the goal of downstream analyses, and we provide guidelines for the best practices.


Asunto(s)
Modelos Genéticos , Recombinación Genética , Algoritmos , Alelos , Genética de Población , Densidad de Población
5.
Elife ; 92020 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-33001029

RESUMEN

Understanding the emergence of novel viruses requires an accurate and comprehensive annotation of their genomes. Overlapping genes (OLGs) are common in viruses and have been associated with pandemics but are still widely overlooked. We identify and characterize ORF3d, a novel OLG in SARS-CoV-2 that is also present in Guangxi pangolin-CoVs but not other closely related pangolin-CoVs or bat-CoVs. We then document evidence of ORF3d translation, characterize its protein sequence, and conduct an evolutionary analysis at three levels: between taxa (21 members of Severe acute respiratory syndrome-related coronavirus), between human hosts (3978 SARS-CoV-2 consensus sequences), and within human hosts (401 deeply sequenced SARS-CoV-2 samples). ORF3d has been independently identified and shown to elicit a strong antibody response in COVID-19 patients. However, it has been misclassified as the unrelated gene ORF3b, leading to confusion. Our results liken ORF3d to other accessory genes in emerging viruses and highlight the importance of OLGs.


Asunto(s)
Betacoronavirus/genética , Infecciones por Coronavirus/virología , Evolución Molecular , Genes Sobrepuestos , Genes Virales , Especificidad del Huésped/genética , Sistemas de Lectura Abierta/genética , Pandemias , Neumonía Viral/virología , Proteínas Virales/genética , Secuencia de Aminoácidos , Animales , Anticuerpos Antivirales/inmunología , Especificidad de Anticuerpos , Antígenos Virales/biosíntesis , Antígenos Virales/genética , Antígenos Virales/inmunología , Betacoronavirus/patogenicidad , Betacoronavirus/fisiología , COVID-19 , China/epidemiología , Quirópteros/virología , Coronavirus/genética , Infecciones por Coronavirus/epidemiología , Epítopos/genética , Epítopos/inmunología , Europa (Continente)/epidemiología , Euterios/virología , Regulación Viral de la Expresión Génica , Variación Genética , Haplotipos/genética , Humanos , Modelos Moleculares , Mutación , Filogenia , Neumonía Viral/epidemiología , Biosíntesis de Proteínas , Conformación Proteica , ARN Viral/genética , SARS-CoV-2 , Alineación de Secuencia , Homología de Secuencia de Ácido Nucleico , Proteínas Virales/inmunología
6.
Mol Biol Evol ; 37(8): 2440-2449, 2020 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-32243542

RESUMEN

Purifying (negative) natural selection is a hallmark of functional biological sequences, and can be detected in protein-coding genes using the ratio of nonsynonymous to synonymous substitutions per site (dN/dS). However, when two genes overlap the same nucleotide sites in different frames, synonymous changes in one gene may be nonsynonymous in the other, perturbing dN/dS. Thus, scalable methods are needed to estimate functional constraint specifically for overlapping genes (OLGs). We propose OLGenie, which implements a modification of the Wei-Zhang method. Assessment with simulations and controls from viral genomes (58 OLGs and 176 non-OLGs) demonstrates low false-positive rates and good discriminatory ability in differentiating true OLGs from non-OLGs. We also apply OLGenie to the unresolved case of HIV-1's putative antisense protein gene, showing significant purifying selection. OLGenie can be used to study known OLGs and to predict new OLGs in genome annotation. Software and example data are freely available at https://github.com/chasewnelson/OLGenie (last accessed April 10, 2020).


Asunto(s)
Genes Sobrepuestos , Técnicas Genéticas , Selección Genética , Mutación Silenciosa , Programas Informáticos , VIH-1/genética
8.
Nat Med ; 25(11): 1796, 2019 11.
Artículo en Inglés | MEDLINE | ID: mdl-31595084

RESUMEN

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

9.
Nat Med ; 25(6): 909-910, 2019 06.
Artículo en Inglés | MEDLINE | ID: mdl-31160814

RESUMEN

We use the genotyping and death register information of 409,693 individuals of British ancestry to investigate fitness effects of the CCR5-∆32 mutation. We estimate a 21% increase in the all-cause mortality rate in individuals who are homozygous for the ∆32 allele. A deleterious effect of the ∆32/∆32 mutation is also independently supported by a significant deviation from the Hardy-Weinberg equilibrium (HWE) due to a deficiency of ∆32/∆32 individuals at the time of recruitment.


Asunto(s)
Homocigoto , Mutación , Receptores CCR5/genética , Adulto , Anciano , Bases de Datos Genéticas/estadística & datos numéricos , Femenino , Aptitud Genética , Infecciones por VIH/prevención & control , Humanos , Esperanza de Vida , Masculino , Persona de Mediana Edad , Mortalidad , Sistema de Registros/estadística & datos numéricos , Tasa de Supervivencia , Reino Unido/epidemiología
10.
Mol Biol Evol ; 36(5): 1008-1021, 2019 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-30903691

RESUMEN

Diminishing returns epistasis causes the benefit of the same advantageous mutation smaller in fitter genotypes and is frequently observed in experimental evolution. However, its occurrence in other contexts, environment dependence, and mechanistic basis are unclear. Here, we address these questions using 1,005 sequenced segregants generated from a yeast cross. Under each of 47 examined environments, 66-92% of tested polymorphisms exhibit diminishing returns epistasis. Surprisingly, improving environment quality also reduces the benefits of advantageous mutations even when fitness is controlled for, indicating the necessity to revise the global epistasis hypothesis. We propose that diminishing returns originates from the modular organization of life where the contribution of each functional module to fitness is determined jointly by the genotype and environment and has an upper limit, and demonstrate that our model predictions match empirical observations. These findings broaden the concept of diminishing returns epistasis, reveal its generality and potential cause, and have important evolutionary implications.


Asunto(s)
Evolución Biológica , Epistasis Genética , Mutación , Ambiente , Interacción Gen-Ambiente , Saccharomyces cerevisiae
11.
PLoS Biol ; 17(1): e3000121, 2019 01.
Artículo en Inglés | MEDLINE | ID: mdl-30682014

RESUMEN

Maximum growth rate per individual (r) and carrying capacity (K) are key life-history traits that together characterize the density-dependent population growth and therefore are crucial parameters of many ecological and evolutionary theories such as r/K selection. Although r and K are generally thought to correlate inversely, both r/K tradeoffs and trade-ups have been observed. Nonetheless, neither the conditions under which each of these relationships occur nor the causes of these relationships are fully understood. Here, we address these questions using yeast as a model system. We estimated r and K using the growth curves of over 7,000 yeast recombinants in nine environments and found that the r-K correlation among genotypes changes from 0.53 to -0.52 with the rise of environment quality, measured by the mean r of all genotypes in the environment. We respectively mapped quantitative trait loci (QTLs) for r and K in each environment. Many QTLs simultaneously influence r and K, but the directions of their effects are environment dependent such that QTLs tend to show concordant effects on the two traits in poor environments but antagonistic effects in rich environments. We propose that these contrasting trends are generated by the relative impacts of two factors-the tradeoff between the speed and efficiency of ATP production and the energetic cost of cell maintenance relative to reproduction-and demonstrate an agreement between model predictions and empirical observations. These results reveal and explain the complex environment dependency of the r-K relationship, which bears on many ecological and evolutionary phenomena and has biomedical implications.


Asunto(s)
Densidad de Población , Levaduras/crecimiento & desarrollo , Evolución Biológica , Conservación de los Recursos Naturales/métodos , Interacción Gen-Ambiente , Pleiotropía Genética/genética , Genotipo , Modelos Biológicos , Mutación/genética , Fenotipo , Crecimiento Demográfico , Sitios de Carácter Cuantitativo , Reproducción/genética , Saccharomyces cerevisiae/crecimiento & desarrollo , Saccharomyces cerevisiae/metabolismo , Levaduras/genética
12.
Sci Adv ; 4(11): eaau5518, 2018 11.
Artículo en Inglés | MEDLINE | ID: mdl-30417098

RESUMEN

Theory predicts that the fitness of an individual is maximized when the genetic distance between its parents (i.e., mating distance) is neither too small nor too large. However, decades of research have generally failed to validate this prediction or identify the optimal mating distance (OMD). Respectively analyzing large numbers of crosses of fungal, plant, and animal model organisms, we indeed find the hybrid phenotypic value a humped quadratic polynomial function of the mating distance for the vast majority of fitness-related traits examined, with different traits of the same species exhibiting similar OMDs. OMDs are generally slightly greater than the nucleotide diversities of the species concerned but smaller than the observed maximal intraspecific genetic distances. Hence, the benefit of heterosis is at least partially offset by the harm of genetic incompatibility even within species. These results have multiple theoretical and practical implications for speciation, conservation, and agriculture.


Asunto(s)
Arabidopsis/genética , Especiación Genética , Vigor Híbrido , Modelos Teóricos , Saccharomyces cerevisiae/genética , Animales , Arabidopsis/crecimiento & desarrollo , Femenino , Genética de Población , Ratones , Fenotipo , Reproducción , Saccharomyces cerevisiae/crecimiento & desarrollo , Especificidad de la Especie
13.
Genome Biol Evol ; 10(8): 2010-2016, 2018 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-30059996

RESUMEN

Ribosomes are highly abundant in cells and comprise, besides RNAs of varying lengths, 55-80 similarly sized, short proteins. This seemingly unusual composition is thought to have resulted from selection for rapid autocatalytic ribosome production. Here, we demonstrate that ribosomal protein-splitting mutations cannot accelerate ribosome production. The autocatalytic explanation is also unnecessary, because protein lengths generally decline with expression levels. Although ribosomal proteins are shorter than expected from their expression levels, they are not outliers among members of large protein complexes in mean protein length or coefficient of variation. These observations are explainable because 1) shortening proteins lowers their synthetic cost and reduces the waste from mistranslation-induced protein dysfunction and degradation, 2) such benefits rise with expression levels, and 3) members of large complexes participate in more protein-protein interactions so are less tolerant to mistranslation. These and other considerations suggest that the compositional features of ribosomes originate from cellular energy economics.


Asunto(s)
Ribosomas/metabolismo , Catálisis , Regulación de la Expresión Génica , ARN Mensajero/genética , ARN Mensajero/metabolismo , ARN Ribosómico/genética , Proteínas Ribosómicas/química , Proteínas Ribosómicas/genética , Proteínas Ribosómicas/metabolismo
14.
Genome Biol Evol ; 9(12): 3509-3515, 2017 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-29228219

RESUMEN

Robustness and evolvability are fundamental characteristics of life whose relationship has intrigued generations of biologists. Studies of several genotype-phenotype maps (GPMs) such as the map between short DNA sequences and their bindings to transcription factors showed that phenotype robustness (PR) promotes phenotype evolvability (PE), but the underlying reason is unclear. Here, we show mathematically that the expected PE is a monotonically increasing function of the expected PR in random GPMs. Population genetic simulations confirm that increasing PR raises the probability that a target phenotype appears in a population within a given time, under empirical as well as randomly rewired GPMs. These and other results demonstrate that the positive correlation between PR and PE is mathematical rather than biological. Hence, it is unsurprising to observe this correlation in every empirical GPM investigated, although the magnitude of the correlation may vary due to influences of various biological factors.


Asunto(s)
Evolución Molecular , Genética de Población , Modelos Genéticos , Fenotipo , Genotipo , Humanos , Mutación , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
15.
Genetics ; 205(2): 925-937, 2017 02.
Artículo en Inglés | MEDLINE | ID: mdl-27903611

RESUMEN

Gene-environment interaction (G×E) refers to the phenomenon that the same mutation has different phenotypic effects in different environments. Although quantitative trait loci (QTLs) exhibiting G×E have been reported, little is known about the general properties of G×E, and those of its underlying QTLs. Here, we use the genotypes of 1005 segregants from a cross between two Saccharomyces cerevisiae strains, and the growth rates of these segregants in 47 environments, to identify growth rate QTLs (gQTLs) in each environment, and QTLs that have different growth effects in each pair of environments (g×eQTLs) . The average number of g×eQTLs identified between two environments is 0.58 times the number of unique gQTLs identified in these environments, revealing a high abundance of G×E. Eighty-seven percent of g×eQTLs belong to gQTLs, supporting the practice of identifying g×eQTLs from gQTLs. Most g×eQTLs identified from gQTLs have concordant effects between environments, but, as the effect size of a mutation in one environment enlarges, the probability of antagonism in the other environment increases. Antagonistic g×eQTLs are enriched in dissimilar environments. Relative to gQTLs, g×eQTLs tend to occur at intronic and synonymous sites. The gene ontology (GO) distributions of gQTLs and g×eQTLs are significantly different, as are those of antagonistic and concordant g×eQTLs. Simulations based on the yeast data showed that ignoring G×E causes substantial missing heritability. Together, our findings reveal the genomic architecture of G×E in yeast growth, and demonstrate the importance of G×E in explaining phenotypic variation and missing heritability.


Asunto(s)
Interacción Gen-Ambiente , Genoma Fúngico , Polimorfismo Genético , Saccharomyces cerevisiae/genética , Sitios de Carácter Cuantitativo , Saccharomyces cerevisiae/crecimiento & desarrollo
16.
J Theor Biol ; 365: 486-95, 2015 Jan 21.
Artículo en Inglés | MEDLINE | ID: mdl-25451534

RESUMEN

The way population size, population structure (with migration), and spatially dependent selection (where there is no globally optimal allele), combine to affect the substitution rate is poorly understood. Here, we consider a two patch model where mutant alleles are beneficial in one patch and deleterious in the other patch. We assume that the spatial average of selection on mutant alleles is zero. We take each patch to maintain a finite number of N adults each generation, hence random genetic drift can independently occur in each patch. We show that the principal way the population size, N, when large, affects the substitution rate, R∞, is through its dependence on two composite parameters. These are the scaled migration rate M (∝ population size × migration rate), and the scaled selection intensity S (∝population size × beneficial effect of a mutant). Any relation between S and M that arises for ecological/evolutionary reasons can strongly influence the way the substitution rate, R∞, depends on the population size, N. In the simplest situation, both M and S are proportional to N, and this is shown to lead to R∞ increasing with N when S is not large. The behaviour, that R∞ increases with N, is not inevitable; a more complex relation between S and M can lead to the opposite or other behaviours. In particular, let us assume that dM/dN is positive, as would occur if the migration rate were constant, S is not large, and S depends on M (i.e., S=S(M)). We then find that if S(M) satisfies S(M)>((1+M)/1+2M)S(0) then the substitution rate, R∞, increases with N, but if S(M)<((1+M)/1+2M)S(0) then R∞ decreases with N.


Asunto(s)
Evolución Biológica , Dinámica Poblacional , Modelos Biológicos , Selección Genética
17.
Genome Biol Evol ; 7(1): 381-90, 2014 Dec 31.
Artículo en Inglés | MEDLINE | ID: mdl-25552532

RESUMEN

Overlapping genes, where one DNA sequence codes for two proteins with different reading frames, are not uncommon in viruses and cellular organisms. Estimating the direction and strength of natural selection acting on overlapping genes is important for understanding their functionality, origin, evolution, maintenance, and potential interaction. However, the standard methods for estimating synonymous (dS) and nonsynonymous (dN) nucleotide substitution rates are inapplicable here because a nucleotide change can be simultaneously synonymous and nonsynonymous when both reading frames involved are considered. We have developed a simple method that can estimate dN/dS and test for the action of natural selection in each relevant reading frame of the overlapping genes. Our method is an extension of the modified Nei-Gojobori method previously developed for nonoverlapping genes. We confirmed the reliability of our method using extensive computer simulation. Applying this method, we studied the longest human sense-antisense overlapping gene pair, LRRC8E and ENSG00000214248. Although LRRC8E (leucine-rich repeat containing eight family, member E) is known to regulate cell size, the function of ENSG00000214248 is unknown. Our analysis revealed purifying selection on ENSG00000214248 and suggested that it originated in the common ancestor of bony vertebrates.


Asunto(s)
Secuencia de Bases/genética , Evolución Molecular , Genes Sobrepuestos , Selección Genética/genética , Sustitución de Aminoácidos/genética , Animales , Simulación por Computador , Humanos , Sistemas de Lectura Abierta , Filogenia
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...