RESUMO
Understanding the contribution of gene-environment interactions (GxE) to complex trait variation can provide insights into disease mechanisms, explain sources of heritability, and improve genetic risk prediction. While large biobanks with genetic and deep phenotypic data hold promise for obtaining novel insights into GxE, our understanding of GxE architecture in complex traits remains limited. We introduce a method to estimate the proportion of trait variance explained by GxE (GxE heritability) and additive genetic effects (additive heritability) across the genome and within specific genomic annotations. We show that our method is accurate in simulations and computationally efficient for biobank-scale datasets. We applied our method to common array SNPs (MAF ≥1%), fifty quantitative traits, and four environmental variables (smoking, sex, age, and statin usage) in unrelated white British individuals in the UK Biobank. We found 68 trait-E pairs with significant genome-wide GxE heritability (p<0.05/200) with a ratio of GxE to additive heritability of ≈6.8% on average. Analyzing ≈8 million imputed SNPs (MAF ≥0.1%), we documented an approximate 28% increase in genome-wide GxE heritability compared to array SNPs. We partitioned GxE heritability across minor allele frequency (MAF) and local linkage disequilibrium (LD) values, revealing that, like additive allelic effects, GxE allelic effects tend to increase with decreasing MAF and LD. Analyzing GxE heritability near genes highly expressed in specific tissues, we find significant brain-specific enrichment for body mass index (BMI) and basal metabolic rate in the context of smoking and adipose-specific enrichment for waist-hip ratio (WHR) in the context of sex.
Assuntos
Interação Gene-Ambiente , Estudo de Associação Genômica Ampla , Herança Multifatorial , Polimorfismo de Nucleotídeo Único , Humanos , Herança Multifatorial/genética , Masculino , Feminino , Característica Quantitativa Herdável , Fenótipo , Modelos Genéticos , Locos de Características QuantitativasRESUMO
BACKGROUND: Polygenic risk score (PRS), calculated based on genome-wide association studies (GWASs), can improve breast cancer (BC) risk assessment. To date, most BC GWASs have been performed in individuals of European (EUR) ancestry, and the generalisation of EUR-based PRS to other populations is a major challenge. In this study, we examined the performance of EUR-based BC PRS models in Ashkenazi Jewish (AJ) women. METHODS: We generated PRSs based on data on EUR women from the Breast Cancer Association Consortium (BCAC). We tested the performance of the PRSs in a cohort of 2161 AJ women from Israel (1437 cases and 724 controls) from BCAC (BCAC cohort from Israel (BCAC-IL)). In addition, we tested the performance of these EUR-based BC PRSs, as well as the established 313-SNP EUR BC PRS, in an independent cohort of 181 AJ women from Hadassah Medical Center (HMC) in Israel. RESULTS: In the BCAC-IL cohort, the highest OR per 1 SD was 1.56 (±0.09). The OR for AJ women at the top 10% of the PRS distribution compared with the middle quintile was 2.10 (±0.24). In the HMC cohort, the OR per 1 SD of the EUR-based PRS that performed best in the BCAC-IL cohort was 1.58±0.27. The OR per 1 SD of the commonly used 313-SNP BC PRS was 1.64 (±0.28). CONCLUSIONS: Extant EUR GWAS data can be used for generating PRSs that identify AJ women with markedly elevated risk of BC and therefore hold promise for improving BC risk assessment in AJ women.
Assuntos
Neoplasias da Mama , Humanos , Feminino , Neoplasias da Mama/epidemiologia , Neoplasias da Mama/genética , Estudo de Associação Genômica Ampla , Judeus/genética , Israel/epidemiologia , Predisposição Genética para Doença , Fatores de Risco , Herança Multifatorial/genética , Fatores de TranscriçãoRESUMO
A central goal in designing clinical trials is to find the test that maximizes power (or equivalently minimizes required sample size) for finding a false null hypothesis subject to the constraint of type I error. When there is more than one test, such as in clinical trials with multiple endpoints, the issues of optimal design and optimal procedures become more complex. In this paper, we address the question of how such optimal tests should be defined and how they can be found. We review different notions of power and how they relate to study goals, and also consider the requirements of type I error control and the nature of the procedures. This leads us to an explicit optimization problem with objective and constraints that describe its specific desiderata. We present a complete solution for deriving optimal procedures for two hypotheses, which have desired monotonicity properties, and are computationally simple. For some of the optimization formulations this yields optimal procedures that are identical to existing procedures, such as Hommel's procedure or the procedure of Bittman et al. (2009), while for other cases it yields completely novel and more powerful procedures than existing ones. We demonstrate the nature of our novel procedures and their improved power extensively in a simulation and on the APEX study (Cohen et al., 2016).
Assuntos
Projetos de Pesquisa , Simulação por Computador , Tamanho da Amostra , Ensaios Clínicos como AssuntoRESUMO
We discuss three issues. In the first part, we discuss the criteria emphasized by Maurer, Bretz, and Xun, warning that it modifies the per comparison error rate that does not address the concerns raised by multiple testing. In the second part, we strengthen the optimality results developed in the paper, based on our recent results. In the third part, we highlight the potentially important role that the use of weights may have in practice and discuss the difficulties in assigning weights that convey the importance in the gain and loss functions, especially as it pertains to multiple endpoints.
Assuntos
Projetos de Pesquisa , Interpretação Estatística de DadosRESUMO
Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum â 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ â p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ â p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ â d , W ∈ â p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
RESUMO
Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. Although model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, although these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on data sets simulated under realistic processes. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared with existing strategies. By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization, concerning the extent of sequence divergence, and features that are related to estimates of the model parameters that are important for the selection made by current criteria.
Assuntos
Aprendizado de Máquina , Modelos Genéticos , FilogeniaRESUMO
Methods that estimate SNP-based heritability and genetic correlations from genome-wide association studies have proven to be powerful tools for investigating the genetic architecture of common diseases and exposing unexpected relationships between disorders. Many relevant studies employ a case-control design, yet most methods are primarily geared toward analyzing quantitative traits. Here we investigate the validity of three common methods for estimating SNP-based heritability and genetic correlation between diseases. We find that the phenotype-correlation-genotype-correlation (PCGC) approach is the only method that can estimate both quantities accurately in the presence of important non-genetic risk factors, such as age and sex. We extend PCGC to work with arbitrary genetic architectures and with summary statistics that take the case-control sampling into account, and we demonstrate that our new method, PCGC-s, accurately estimates both SNP-based heritability and genetic correlations and can be applied to large datasets without requiring individual-level genotypic or phenotypic information. Finally, we use PCGC-s to estimate the genetic correlation between schizophrenia and bipolar disorder and demonstrate that previous estimates are biased, partially due to incorrect handling of sex as a strong risk factor.
Assuntos
Doença/genética , Polimorfismo de Nucleotídeo Único/genética , Estudos de Casos e Controles , Estudos de Associação Genética/métodos , Estudo de Associação Genômica Ampla/métodos , Genótipo , Humanos , Modelos Genéticos , FenótipoRESUMO
Estimation of heritability is fundamental in genetic studies. Recently, heritability estimation using linear mixed models (LMMs) has gained popularity because these estimates can be obtained from unrelated individuals collected in genome-wide association studies. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. Existing methods for the construction of confidence intervals and estimators of SEs for REML rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals. Here, we show that the estimation of confidence intervals by state-of-the-art methods is inaccurate, especially when the true heritability is relatively low or relatively high. We further show that these inaccuracies occur in datasets including thousands of individuals. Such biases are present, for example, in estimates of heritability of gene expression in the Genotype-Tissue Expression project and of lipid profiles in the Ludwigshafen Risk and Cardiovascular Health study. We also show that often the probability that the genetic component is estimated as 0 is high even when the true heritability is bounded away from 0, emphasizing the need for accurate confidence intervals. We propose a computationally efficient method, ALBI (accurate LMM-based heritability bootstrap confidence intervals), for estimating the distribution of the heritability estimator and for constructing accurate confidence intervals. Our method can be used as an add-on to existing methods for estimating heritability and variance components, such as GCTA, FaST-LMM, GEMMA, or EMMAX.
Assuntos
Doenças Cardiovasculares/genética , Intervalos de Confiança , Interação Gene-Ambiente , Herança Multifatorial/genética , Polimorfismo de Nucleotídeo Único/genética , Característica Quantitativa Herdável , Simulação por Computador , Estudo de Associação Genômica Ampla , Genótipo , Humanos , Modelos Genéticos , Modelos EstatísticosRESUMO
Linear mixed models (LMMs) and their extensions have recently become the method of choice in phenotype prediction for complex traits. However, LMM use to date has typically been limited by assuming simple genetic architectures. Here, we present multikernel linear mixed model (MKLMM), a predictive modeling framework that extends the standard LMM using multiple-kernel machine learning approaches. MKLMM can model genetic interactions and is particularly suitable for modeling complex local interactions between nearby variants. We additionally present MKLMM-Adapt, which automatically infers interaction types across multiple genomic regions. In an analysis of eight case-control data sets from the Wellcome Trust Case Control Consortium and more than a hundred mouse phenotypes, MKLMM-Adapt consistently outperforms competing methods in phenotype prediction. MKLMM is as computationally efficient as standard LMMs and does not require storage of genotypes, thus achieving state-of-the-art predictive power without compromising computational feasibility or genomic privacy.
Assuntos
Modelos Genéticos , Algoritmos , Animais , Estudos de Casos e Controles , Colite Ulcerativa/genética , Simulação por Computador , Humanos , Modelos Lineares , Camundongos , Fenótipo , SoftwareRESUMO
BACKGROUND: The metabolic syndrome (MetS) is associated with overweight and abdominal obesity. Our aim was to use longitudinal measurements to provide clinically relevant information on the relative influence of changes in body mass index (BMI), waist circumference (WC), and weekly physical exercise duration on the development of each of the MetS components. METHODS: We analyzed data collected at the Tel-Aviv Medical Center Inflammation Survey (TAMCIS). Apparently healthy individuals with two consecutive visits that were not treated for any metabolic criteria were included in this study. We analyzed the influence of changes in BMI, WC, and time engaged in physical exercise on the change in each of the components of the metabolic syndrome using linear regressions. RESULTS: Included were 7532 individuals (5431 men, 2101 women) with 2 years follow-up. Participants who gained two BMI points, had the mean number of criteria increase from 1.07 to 1.52, while participants who lost two BMI points, decreased from 1.64 to 1.16. A long-term analysis over 5 years showed similar results. Furthermore, an increase in WC was independently associated with increased severity of each of the other components, when controlling for increase in BMI. Increase in weekly exercise duration had a small but statistically significant favorable effect on blood triglycerides and HDL levels, but not on blood pressure or HbA1C. CONCLUSIONS: Changes in BMI and WC are highly associative with the likelihood and severity of the MetS independently of the baseline levels, suggesting that obese individuals can substantially improve their MetS prognosis by losing both body weight and abdominal fat.
Assuntos
Inflamação/complicações , Síndrome Metabólica/etiologia , Obesidade Abdominal/complicações , Aumento de Peso/fisiologia , Adulto , Índice de Massa Corporal , Feminino , Seguimentos , Inquéritos Epidemiológicos , Humanos , Inflamação/epidemiologia , Inflamação/fisiopatologia , Israel/epidemiologia , Masculino , Síndrome Metabólica/epidemiologia , Síndrome Metabólica/fisiopatologia , Pessoa de Meia-Idade , Obesidade Abdominal/epidemiologia , Obesidade Abdominal/fisiopatologia , Fatores de Risco , Circunferência da Cintura/fisiologiaRESUMO
BACKGROUND: We study Phylotree, a comprehensive representation of the phylogeny of global human mitochondrial DNA (mtDNA) variations, to better understand the mtDNA substitution mechanism and its most influential factors. We consider a substitution model, where a set of genetic features may predict the rate at which mtDNA substitutions occur. To find an appropriate model, an exhaustive analysis on the effect of multiple factors on the substitution rate is performed through Negative Binomial and Poisson regressions. We examine three different inclusion options for each categorical factor: omission, inclusion as an explanatory variable, and by-value partitioning. The examined factors include genes, codon position, a CpG indicator, directionality, nucleotide, amino acid, codon, and context (neighboring nucleotides), in addition to other site based factors. Partitioning a model by a factor's value results in several sub-models (one for each value), where the likelihoods of the sub-models can be combined to form a score for the entire model. Eventually, the leading models are considered as viable candidates for explaining mtDNA substitution rates. RESULTS: Initially, we introduce a novel clustering technique on genes, based on three similarity tests between pairs of genes, supporting previous results regarding gene functionalities in the mtDNA. These clusters are then used as a factor in our models. We present leading models for the protein coding genes, rRNA and tRNA genes and the control region, showing it is disadvantageous to separate the models of transitions/transversions, or synonymous/non-synonymous substitutions. We identify a context effect that cannot be attributed solely to protein level constraints or CpG pairs. For protein-coding genes, we show that the substitution model should be partitioned into sub-models according to the codon position and input codon; additionally we confirm that gene identity and cluster have no significant effect once the above factors are accounted for. CONCLUSIONS: We leverage the large, high-confidence Phylotree mtDNA phylogeny to develop a new statistical approach. We model the substitution rates using regressions, allowing consideration of many factors simultaneously. This admits the use of model selection tools helping to identify the set of factors best explaining the mutational dynamics when considered in tandem.
Assuntos
Big Data , DNA Mitocondrial/genética , Modelos Estatísticos , Algoritmos , Mineração de Dados , Humanos , Distribuição de Poisson , RNA Ribossômico/genética , RNA de Transferência/genética , Análise de RegressãoRESUMO
The cortical layers are a finger print of brain development, function, connectivity and pathology. Obviously, the formation of the layers and their composition is essential to cognition and behavior. The layers were traditionally measured by histological means but recent studies utilizing MRI suggested that T1 relaxation imaging consist of enough contrast to separate the layers. Indeed extreme resolution, post mortem, studies demonstrated this phenomenon. Yet, one of the limiting factors of using T1 MRI to visualize the layers in neuroimaging research is partial volume effect. This happen when the image resolution is not high enough and two or more layers resides within the same voxel. In this paper we demonstrate that due to the physical small thickness of the layers it is highly unlikely that high resolution imaging could resolve the layers. By contrast, we suggest that low resolution multi T1 mapping conjugate with composition analysis could provide practical means for measuring the T1 layers. We suggest an acquisition platform that is clinically feasible and could quantify measures of the layers. The key feature of the suggested platform is that separation of the layers is better achieved in the T1 relaxation domain rather than in the spatial image domain.
Assuntos
Mapeamento Encefálico/métodos , Córtex Cerebral/diagnóstico por imagem , Processamento de Imagem Assistida por Computador/métodos , Imageamento por Ressonância Magnética/métodos , Adulto , Animais , Feminino , Humanos , Masculino , RatosRESUMO
MOTIVATION: Epigenome-wide association studies can provide novel insights into the regulation of genes involved in traits and diseases. The rapid emergence of bisulfite-sequencing technologies enables performing such genome-wide studies at the resolution of single nucleotides. However, analysis of data produced by bisulfite-sequencing poses statistical challenges owing to low and uneven sequencing depth, as well as the presence of confounding factors. The recently introduced Mixed model Association for Count data via data AUgmentation (MACAU) can address these challenges via a generalized linear mixed model when confounding can be encoded via a single variance component. However, MACAU cannot be used in the presence of multiple variance components. Additionally, MACAU uses a computationally expensive Markov Chain Monte Carlo (MCMC) procedure, which cannot directly approximate the model likelihood. RESULTS: We present a new method, Mixed model Association via a Laplace ApproXimation (MALAX), that is more computationally efficient than MACAU and allows to model multiple variance components. MALAX uses a Laplace approximation rather than MCMC based approximations, which enables to directly approximate the model likelihood. Through an extensive analysis of simulated and real data, we demonstrate that MALAX successfully addresses statistical challenges introduced by bisulfite-sequencing while controlling for complex sources of confounding, and can be over 50% faster than the state of the art. AVAILABILITY AND IMPLEMENTATION: The full source code of MALAX is available at https://github.com/omerwe/MALAX . CONTACT: omerw@cs.technion.ac.il or ehalperin@cs.ucla.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Metilação de DNA , Epigenômica/métodos , Análise de Sequência de DNA/métodos , Software , Humanos , Cadeias de Markov , Método de Monte Carlo , SulfitosRESUMO
Background: Inheritance of apolipoprotein L1 gene (APOL1) renal-risk variants in a recessive pattern strongly associates with non-diabetic end-stage kidney disease (ESKD). Further evidence supports risk modifiers in APOL1-associated nephropathy; some studies demonstrate that heterozygotes possess excess risk for ESKD or show earlier age at ESKD, relative to those with zero risk alleles. Nearby loci are also associated with ESKD in non-African Americans. Methods: We assessed the role of the APOL3 null allele rs11089781 on risk of non-diabetic ESKD. Four cohorts containing 2781 ESKD cases and 2474 controls were analyzed. Results: Stratifying by APOL1 risk genotype (recessive) and adjusting for African ancestry identified a significant additive association between rs11089781 and ESKD in each stratum and in a meta-analysis [meta-analysis P = 0.0070; odds ratio (OR) = 1.29]; ORs were consistent across APOL1 risk strata. The biological significance of this association is supported by the finding that the APOL3 gene is co-regulated with APOL1, and that APOL3 protein was able to bind to APOL1 protein. Conclusions: Taken together, the genetic and biological data support the concept that other APOL proteins besides APOL1 may also influence the risk of non-diabetic ESKD.
Assuntos
Apolipoproteínas L/genética , Predisposição Genética para Doença , Glomerulonefrite/genética , Glomerulosclerose Segmentar e Focal/genética , Falência Renal Crônica/genética , Polimorfismo de Nucleotídeo Único , Estudos de Casos e Controles , Genótipo , Humanos , Metanálise como Assunto , PrognósticoRESUMO
For predicting genetic risk, we propose a statistical approach that is specifically adapted to dealing with the challenges imposed by disease phenotypes and case-control sampling. Our approach (termed Genetic Risk Scores Inference [GeRSI]), combines the power of fixed-effects models (which estimate and aggregate the effects of single SNPs) and random-effects models (which rely primarily on whole-genome similarities between individuals) within the framework of the widely used liability-threshold model. We demonstrate in extensive simulation that GeRSI produces predictions that are consistently superior to current state-of-the-art approaches. When applying GeRSI to seven phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) study, we confirm that the use of random effects is most beneficial for diseases that are known to be highly polygenic: hypertension (HT) and bipolar disorder (BD). For HT, there are no significant associations in the WTCCC data. The fixed-effects model yields an area under the ROC curve (AUC) of 54%, whereas GeRSI improves it to 59%. For BD, using GeRSI improves the AUC from 55% to 62%. For individuals ranked at the top 10% of BD risk predictions, using GeRSI substantially increases the BD relative risk from 1.4 to 2.5.
Assuntos
Biologia Computacional , Doença/genética , Predisposição Genética para Doença , Modelos Estatísticos , Herança Multifatorial/genética , Estudos de Casos e Controles , Estudo de Associação Genômica Ampla , Humanos , Polimorfismo de Nucleotídeo Único/genética , Medição de RiscoRESUMO
Genome-wide association studies (GWASs), also called common variant association studies (CVASs), have uncovered thousands of genetic variants associated with hundreds of diseases. However, the variants that reach statistical significance typically explain only a small fraction of the heritability. One explanation for the "missing heritability" is that there are many additional disease-associated common variants whose effects are too small to detect with current sample sizes. It therefore is useful to have methods to quantify the heritability due to common variation, without having to identify all causal variants. Recent studies applied restricted maximum likelihood (REML) estimation to case-control studies for diseases. Here, we show that REML considerably underestimates the fraction of heritability due to common variation in this setting. The degree of underestimation increases with the rarity of disease, the heritability of the disease, and the size of the sample. Instead, we develop a general framework for heritability estimation, called phenotype correlation-genotype correlation (PCGC) regression, which generalizes the well-known Haseman-Elston regression method. We show that PCGC regression yields unbiased estimates. Applying PCGC regression to six diseases, we estimate the proportion of the phenotypic variance due to common variants to range from 25% to 56% and the proportion of heritability due to common variants from 41% to 68% (mean 60%). These results suggest that common variants may explain at least half the heritability for many diseases. PCGC regression also is readily applicable to other settings, including analyzing extreme-phenotype studies and adjusting for covariates such as sex, age, and population structure.
Assuntos
Doenças Genéticas Inatas/genética , Variação Genética , Projetos de Pesquisa , Alelos , Estudos de Casos e Controles , Simulação por Computador , Frequência do Gene , Estudos de Associação Genética , Estudo de Associação Genômica Ampla , Genômica , Genótipo , Humanos , Modelos Genéticos , Modelos Estatísticos , Fenótipo , Polimorfismo de Nucleotídeo Único , Análise de RegressãoRESUMO
BACKGROUND: The generalization of the second Chargaff rule states that counts of any string of nucleotides of length k on a single chromosomal strand equal the counts of its inverse (reverse-complement) k-mer. This Inversion Symmetry (IS) holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. The existence of IS has been demonstrated in the literature, and other pair-wise candidate symmetries (e.g. reverse or complement) have been ruled out. RESULTS: Studying IS in the human genome, we find that IS holds up to k = 10. It holds for complete chromosomes, also after applying the low complexity mask. We introduce a numerical IS criterion, and define the k-limit, KL, as the highest k for which this criterion is valid. We demonstrate that chromosomes of different species, as well as different human chromosomal sections, follow a universal logarithmic dependence of KL ~ 0.7 ln(L), where L is the length of the chromosome. We introduce a statistical IS-Poisson model that allows us to apply confidence measures to our numerical findings. We find good agreement for large k, where the variance of the Poisson distribution determines the outcome of the analysis. This model predicts the observed logarithmic increase of KL with length. The model allows us to conclude that for low k, e.g. k = 1 where IS becomes the 2(nd) Chargaff rule, IS violation, although extremely small, is significant. Studying this violation we come up with an unexpected observation for human chromosomes, finding a meaningful correlation with the excess of genes on particular strands. CONCLUSIONS: Our IS-Poisson model agrees well with genomic data, and accounts for the universal behavior of k-limits. For low k we point out minute, yet significant, deviations from the model, including excess of counts of nucleotides T vs A and G vs C on positive strands of human chromosomes. Interestingly, this correlates with a significant (but small) excess of genes on the same positive strands.
Assuntos
Cromossomos Humanos/genética , DNA/genética , Humanos , Modelos Genéticos , Distribuição de PoissonRESUMO
Contemporary Jews comprise an aggregate of ethno-religious communities whose worldwide members identify with each other through various shared religious, historical and cultural traditions. Historical evidence suggests common origins in the Middle East, followed by migrations leading to the establishment of communities of Jews in Europe, Africa and Asia, in what is termed the Jewish Diaspora. This complex demographic history imposes special challenges in attempting to address the genetic structure of the Jewish people. Although many genetic studies have shed light on Jewish origins and on diseases prevalent among Jewish communities, including studies focusing on uniparentally and biparentally inherited markers, genome-wide patterns of variation across the vast geographic span of Jewish Diaspora communities and their respective neighbours have yet to be addressed. Here we use high-density bead arrays to genotype individuals from 14 Jewish Diaspora communities and compare these patterns of genome-wide diversity with those from 69 Old World non-Jewish populations, of which 25 have not previously been reported. These samples were carefully chosen to provide comprehensive comparisons between Jewish and non-Jewish populations in the Diaspora, as well as with non-Jewish populations from the Middle East and north Africa. Principal component and structure-like analyses identify previously unrecognized genetic substructure within the Middle East. Most Jewish samples form a remarkably tight subcluster that overlies Druze and Cypriot samples but not samples from other Levantine populations or paired Diaspora host populations. In contrast, Ethiopian Jews (Beta Israel) and Indian Jews (Bene Israel and Cochini) cluster with neighbouring autochthonous populations in Ethiopia and western India, respectively, despite a clear paternal link between the Bene Israel and the Levant. These results cast light on the variegated genetic architecture of the Middle East, and trace the origins of most Jewish Diaspora communities to the Levant.
Assuntos
Genoma Humano/genética , Judeus/genética , África do Norte/etnologia , Alelos , Ásia , Cromossomos Humanos Y/genética , DNA Mitocondrial/genética , Etiópia/etnologia , Europa (Continente) , Genótipo , Geografia , Humanos , Índia/etnologia , Judeus/classificação , Oriente Médio/etnologia , Filogenia , Análise de Componente PrincipalRESUMO
Issues of publication bias, lack of replicability, and false discovery have long plagued the genetics community. Proper utilization of public and shared data resources presents an opportunity to ameliorate these problems. We present an approach to public database management that we term Quality Preserving Database (QPD). It enables perpetual use of the database for testing statistical hypotheses while controlling false discovery and avoiding publication bias on the one hand, and maintaining testing power on the other hand. We demonstrate it on a use case of a replication server for GWAS findings, underlining its practical utility. We argue that a shift to using QPD in managing current and future biological databases will significantly enhance the community's ability to make efficient and statistically sound use of the available data resources.
Assuntos
Bases de Dados Factuais/normas , Gestão da Informação/métodos , Setor Público , Bases de Dados Factuais/economia , Gestão da Informação/economia , Gestão da Informação/normas , Viés de Publicação , Controle de Qualidade , Reprodutibilidade dos TestesRESUMO
Mutational events along the human mtDNA phylogeny are traditionally identified relative to the revised Cambridge Reference Sequence, a contemporary European sequence published in 1981. This historical choice is a continuous source of inconsistencies, misinterpretations, and errors in medical, forensic, and population genetic studies. Here, after having refined the human mtDNA phylogeny to an unprecedented level by adding information from 8,216 modern mitogenomes, we propose switching the reference to a Reconstructed Sapiens Reference Sequence, which was identified by considering all available mitogenomes from Homo neanderthalensis. This "Copernican" reassessment of the human mtDNA tree from its deepest root should resolve previous problems and will have a substantial practical and educational influence on the scientific and public perception of human evolution by clarifying the core principles of common ancestry for extant descendants.