Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 60
Filtrar
1.
BMC Genom Data ; 25(1): 4, 2024 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-38166646

RESUMO

BACKGROUND: We tackle the problem of estimating species TMRCAs (Time to Most Recent Common Ancestor), given a genome sequence from each species and a large known phylogenetic tree with a known structure (typically from one of the species). The number of transitions at each site from the first sequence to the other is assumed to be Poisson distributed, and only the parity of the number of transitions is observed. The detailed phylogenetic tree contains information about the transition rates in each site. We use this formulation to develop and analyze multiple estimators of the species' TMRCA. To test our methods, we use mtDNA substitution statistics from the well-established Phylotree as a baseline for data simulation such that the substitution rate per site mimics the real-world observed rates. RESULTS: We evaluate our methods using simulated data and compare them to the Bayesian optimizing software BEAST2, showing that our proposed estimators are accurate for a wide range of TMRCAs and significantly outperform BEAST2. We then apply the proposed estimators on Neanderthal, Denisovan, and Chimpanzee mtDNA genomes to better estimate their TMRCA with modern humans and find that their TMRCA is substantially later, compared to values cited recently in the literature. CONCLUSIONS: Our methods utilize the transition statistics from the entire known human mtDNA phylogenetic tree (Phylotree), eliminating the requirement to reconstruct a tree encompassing the specific sequences of interest. Moreover, they demonstrate notable improvement in both running speed and accuracy compared to BEAST2, particularly for earlier TMRCAs like the human-Chimpanzee split. Our results date the human - Neanderthal TMRCA to be [Formula: see text] years ago, considerably later than values cited in other recent studies.


Assuntos
Hominidae , Homem de Neandertal , Animais , Humanos , Homem de Neandertal/genética , Filogenia , Pan troglodytes/genética , Teorema de Bayes , Hominidae/genética , DNA Mitocondrial/genética
2.
Biometrics ; 79(4): 2794-2797, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-38115576

RESUMO

We discuss three issues. In the first part, we discuss the criteria emphasized by Maurer, Bretz, and Xun, warning that it modifies the per comparison error rate that does not address the concerns raised by multiple testing. In the second part, we strengthen the optimality results developed in the paper, based on our recent results. In the third part, we highlight the potentially important role that the use of weights may have in practice and discuss the difficulties in assigning weights that convey the importance in the gain and loss functions, especially as it pertains to multiple endpoints.


Assuntos
Projetos de Pesquisa , Interpretação Estatística de Dados
3.
J Med Genet ; 60(12): 1186-1197, 2023 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-37451831

RESUMO

BACKGROUND: Polygenic risk score (PRS), calculated based on genome-wide association studies (GWASs), can improve breast cancer (BC) risk assessment. To date, most BC GWASs have been performed in individuals of European (EUR) ancestry, and the generalisation of EUR-based PRS to other populations is a major challenge. In this study, we examined the performance of EUR-based BC PRS models in Ashkenazi Jewish (AJ) women. METHODS: We generated PRSs based on data on EUR women from the Breast Cancer Association Consortium (BCAC). We tested the performance of the PRSs in a cohort of 2161 AJ women from Israel (1437 cases and 724 controls) from BCAC (BCAC cohort from Israel (BCAC-IL)). In addition, we tested the performance of these EUR-based BC PRSs, as well as the established 313-SNP EUR BC PRS, in an independent cohort of 181 AJ women from Hadassah Medical Center (HMC) in Israel. RESULTS: In the BCAC-IL cohort, the highest OR per 1 SD was 1.56 (±0.09). The OR for AJ women at the top 10% of the PRS distribution compared with the middle quintile was 2.10 (±0.24). In the HMC cohort, the OR per 1 SD of the EUR-based PRS that performed best in the BCAC-IL cohort was 1.58±0.27. The OR per 1 SD of the commonly used 313-SNP BC PRS was 1.64 (±0.28). CONCLUSIONS: Extant EUR GWAS data can be used for generating PRSs that identify AJ women with markedly elevated risk of BC and therefore hold promise for improving BC risk assessment in AJ women.


Assuntos
Neoplasias da Mama , Humanos , Feminino , Neoplasias da Mama/epidemiologia , Neoplasias da Mama/genética , Estudo de Associação Genômica Ampla , Judeus/genética , Israel/epidemiologia , Predisposição Genética para Doença , Fatores de Risco , Herança Multifatorial/genética , Fatores de Transcrição
4.
Biometrics ; 79(3): 1908-1919, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-35899317

RESUMO

A central goal in designing clinical trials is to find the test that maximizes power (or equivalently minimizes required sample size) for finding a false null hypothesis subject to the constraint of type I error. When there is more than one test, such as in clinical trials with multiple endpoints, the issues of optimal design and optimal procedures become more complex. In this paper, we address the question of how such optimal tests should be defined and how they can be found. We review different notions of power and how they relate to study goals, and also consider the requirements of type I error control and the nature of the procedures. This leads us to an explicit optimization problem with objective and constraints that describe its specific desiderata. We present a complete solution for deriving optimal procedures for two hypotheses, which have desired monotonicity properties, and are computationally simple. For some of the optimization formulations this yields optimal procedures that are identical to existing procedures, such as Hommel's procedure or the procedure of Bittman et al. (2009), while for other cases it yields completely novel and more powerful procedures than existing ones. We demonstrate the nature of our novel procedures and their improved power extensively in a simulation and on the APEX study (Cohen et al., 2016).


Assuntos
Projetos de Pesquisa , Simulação por Computador , Tamanho da Amostra , Ensaios Clínicos como Assunto
5.
bioRxiv ; 2023 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-38168200

RESUMO

Understanding the contribution of gene-environment interactions (GxE) to complex trait variation can provide insights into mechanisms underlying disease risk, explain sources of heritability, and improve the accuracy of genetic risk prediction. While biobanks that collect genetic and deep phenotypic data over large numbers of individuals offer the promise of obtaining novel insights into GxE, our understanding of the architecture of GxE in complex traits remains limited. We introduce a method that can estimate the proportion of trait variance explained by GxE (GxE heritability) and additive genetic effects (additive heritability) across the genome and within specific genomic annotations. We show that our method is accurate in simulations and computationally efficient for biobank-scale datasets. We applied our method to ≈ 500, 000 common array SNPs (MAF ≥ 1%), fifty quantitative traits, and four environmental variables (smoking, sex, age, and statin usage) measured across ≈ 300, 000 unrelated white British individuals in the UK Biobank. We found 69 trait-environmental variable pairs with significant genome-wide GxE heritability (p < 0.05/200 correcting for the number of trait-E pairs tested) with an average ratio of GxE to additive heritability ≈ 6.8% that include BMI with smoking (ratio of GxE to additive heritability = 6.3 ± 1.1%), WHR (waist-to-hip ratio adjusted for BMI) with sex (ratio = 19.6 ± 2%), LDL cholesterol with age (ratio = 9.8 ± 3.9%), and HbA1c with statin usage (ratio = 11 ± 2%). Analyzing nearly 8 million common and low-frequency imputed SNPs (MAF ≥ 0.1%), we document an increase in genome-wide GxE heritability of about 28% on average over array SNPs. We partitioned GxE heritability across minor allele frequency (MAF) and local linkage disequilibrium values (LD score) of each SNP to observe that analogous to the relationship that has been observed for additive allelic effects, the magnitude of GxE allelic effects tends to increase with decreasing MAF and LD. Testing whether GxE heritability is enriched around genes that are highly expressed in specific tissues, we find significant tissue-specific enrichments that include brain-specific enrichment for BMI and Basal Metabolic Rate in the context of smoking, adipose-specific enrichment for WHR in the context of sex, and cardiovascular tissue-specific enrichment for total cholesterol in the context of age. Our analyses provide detailed insights into the architecture of GxE underlying complex traits.

6.
Ann Stat ; 50(2): 949-986, 2022 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-36120512

RESUMO

Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ ℝ p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ ℝ p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ ℝ d , W ∈ ℝ p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

7.
Commun Biol ; 5(1): 285, 2022 03 29.
Artigo em Inglês | MEDLINE | ID: mdl-35351970

RESUMO

We build statistical models to describe the substitution process in the SARS-CoV-2 as a function of explanatory factors describing the sequence, its function, and more. These models serve two different purposes: first, to gain knowledge about the evolutionary biology of the virus; and second, to predict future mutations in the virus, in particular, non-synonymous amino acid substitutions creating new variants. We use tens of thousands of publicly available SARS-CoV-2 sequences and consider tens of thousands of candidate models. Through a careful validation process, we confirm that our chosen models are indeed able to predict new amino acid substitutions: candidates ranked high by our model are eight times more likely to occur than random amino acid changes. We also show that named variants were highly ranked by our models before their appearance, emphasizing the value of our models for identifying likely variants and potentially utilizing this knowledge in vaccine design and other aspects of the ongoing battle against COVID-19.


Assuntos
COVID-19 , SARS-CoV-2 , Substituição de Aminoácidos , COVID-19/genética , Humanos , Modelos Estatísticos , Mutação de Sentido Incorreto , SARS-CoV-2/genética
8.
Mol Biol Evol ; 37(11): 3338-3352, 2020 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-32585030

RESUMO

Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. Although model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, although these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on data sets simulated under realistic processes. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared with existing strategies. By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization, concerning the extent of sequence divergence, and features that are related to estimates of the model parameters that are important for the selection made by current criteria.


Assuntos
Aprendizado de Máquina , Modelos Genéticos , Filogenia
9.
Nat Commun ; 10(1): 3417, 2019 07 31.
Artigo em Inglês | MEDLINE | ID: mdl-31366909

RESUMO

High costs and technical limitations of cell sorting and single-cell techniques currently restrict the collection of large-scale, cell-type-specific DNA methylation data. This, in turn, impedes our ability to tackle key biological questions that pertain to variation within a population, such as identification of disease-associated genes at a cell-type-specific resolution. Here, we show mathematically and empirically that cell-type-specific methylation levels of an individual can be learned from its tissue-level bulk data, conceptually emulating the case where the individual has been profiled with a single-cell resolution and then signals were aggregated in each cell population separately. Provided with this unprecedented way to perform powerful large-scale epigenetic studies with cell-type-specific resolution, we revisit previous studies with tissue-level bulk methylation and reveal novel associations with leukocyte composition in blood and with rheumatoid arthritis. For the latter, we further show consistency with validation data collected from sorted leukocyte sub-types.


Assuntos
Separação Celular/métodos , Biologia Computacional/métodos , Metilação de DNA/genética , Epigênese Genética/genética , Análise de Célula Única/métodos , Artrite Reumatoide/sangue , Ilhas de CpG/genética , Humanos , Contagem de Leucócitos , Leucócitos/classificação , Leucócitos/citologia
10.
Int J Obes (Lond) ; 43(4): 800-807, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30647453

RESUMO

BACKGROUND: The metabolic syndrome (MetS) is associated with overweight and abdominal obesity. Our aim was to use longitudinal measurements to provide clinically relevant information on the relative influence of changes in body mass index (BMI), waist circumference (WC), and weekly physical exercise duration on the development of each of the MetS components. METHODS: We analyzed data collected at the Tel-Aviv Medical Center Inflammation Survey (TAMCIS). Apparently healthy individuals with two consecutive visits that were not treated for any metabolic criteria were included in this study. We analyzed the influence of changes in BMI, WC, and time engaged in physical exercise on the change in each of the components of the metabolic syndrome using linear regressions. RESULTS: Included were 7532 individuals (5431 men, 2101 women) with 2 years follow-up. Participants who gained two BMI points, had the mean number of criteria increase from 1.07 to 1.52, while participants who lost two BMI points, decreased from 1.64 to 1.16. A long-term analysis over 5 years showed similar results. Furthermore, an increase in WC was independently associated with increased severity of each of the other components, when controlling for increase in BMI. Increase in weekly exercise duration had a small but statistically significant favorable effect on blood triglycerides and HDL levels, but not on blood pressure or HbA1C. CONCLUSIONS: Changes in BMI and WC are highly associative with the likelihood and severity of the MetS independently of the baseline levels, suggesting that obese individuals can substantially improve their MetS prognosis by losing both body weight and abdominal fat.


Assuntos
Inflamação/complicações , Síndrome Metabólica/etiologia , Obesidade Abdominal/complicações , Aumento de Peso/fisiologia , Adulto , Índice de Massa Corporal , Feminino , Seguimentos , Inquéritos Epidemiológicos , Humanos , Inflamação/epidemiologia , Inflamação/fisiopatologia , Israel/epidemiologia , Masculino , Síndrome Metabólica/epidemiologia , Síndrome Metabólica/fisiopatologia , Pessoa de Meia-Idade , Obesidade Abdominal/epidemiologia , Obesidade Abdominal/fisiopatologia , Fatores de Risco , Circunferência da Cintura/fisiologia
11.
Nat Commun ; 9(1): 4919, 2018 11 21.
Artigo em Inglês | MEDLINE | ID: mdl-30464216

RESUMO

Testing for association between a set of genetic markers and a phenotype is a fundamental task in genetic studies. Standard approaches for heritability and set testing strongly rely on parametric models that make specific assumptions regarding phenotypic variability. Here, we show that resulting p-values may be inflated by up to 15 orders of magnitude, in a heritability study of methylation measurements, and in a heritability and expression quantitative trait loci analysis of gene expression profiles. We propose FEATHER, a method for fast permutation-based testing of marker sets and of heritability, which properly controls for false-positive results. FEATHER eliminated 47% of methylation sites found to be heritable by the parametric test, suggesting a substantial inflation of false-positive findings by alternative methods. Our approach can rapidly identify heritable phenotypes out of millions of phenotypes acquired via high-throughput technologies, does not suffer from model misspecification and is highly efficient.


Assuntos
Técnicas Genéticas , Característica Quantitativa Herdável , Estatística como Assunto , Metilação de DNA , Expressão Gênica , Fenótipo
12.
BMC Genomics ; 19(1): 759, 2018 Oct 19.
Artigo em Inglês | MEDLINE | ID: mdl-30340456

RESUMO

BACKGROUND: We study Phylotree, a comprehensive representation of the phylogeny of global human mitochondrial DNA (mtDNA) variations, to better understand the mtDNA substitution mechanism and its most influential factors. We consider a substitution model, where a set of genetic features may predict the rate at which mtDNA substitutions occur. To find an appropriate model, an exhaustive analysis on the effect of multiple factors on the substitution rate is performed through Negative Binomial and Poisson regressions. We examine three different inclusion options for each categorical factor: omission, inclusion as an explanatory variable, and by-value partitioning. The examined factors include genes, codon position, a CpG indicator, directionality, nucleotide, amino acid, codon, and context (neighboring nucleotides), in addition to other site based factors. Partitioning a model by a factor's value results in several sub-models (one for each value), where the likelihoods of the sub-models can be combined to form a score for the entire model. Eventually, the leading models are considered as viable candidates for explaining mtDNA substitution rates. RESULTS: Initially, we introduce a novel clustering technique on genes, based on three similarity tests between pairs of genes, supporting previous results regarding gene functionalities in the mtDNA. These clusters are then used as a factor in our models. We present leading models for the protein coding genes, rRNA and tRNA genes and the control region, showing it is disadvantageous to separate the models of transitions/transversions, or synonymous/non-synonymous substitutions. We identify a context effect that cannot be attributed solely to protein level constraints or CpG pairs. For protein-coding genes, we show that the substitution model should be partitioned into sub-models according to the codon position and input codon; additionally we confirm that gene identity and cluster have no significant effect once the above factors are accounted for. CONCLUSIONS: We leverage the large, high-confidence Phylotree mtDNA phylogeny to develop a new statistical approach. We model the substitution rates using regressions, allowing consideration of many factors simultaneously. This admits the use of model selection tools helping to identify the set of factors best explaining the mutational dynamics when considered in tandem.


Assuntos
Big Data , DNA Mitocondrial/genética , Modelos Estatísticos , Algoritmos , Mineração de Dados , Humanos , Distribuição de Poisson , RNA Ribossômico/genética , RNA de Transferência/genética , Análise de Regressão
13.
Am J Hum Genet ; 103(1): 89-99, 2018 07 05.
Artigo em Inglês | MEDLINE | ID: mdl-29979983

RESUMO

Methods that estimate SNP-based heritability and genetic correlations from genome-wide association studies have proven to be powerful tools for investigating the genetic architecture of common diseases and exposing unexpected relationships between disorders. Many relevant studies employ a case-control design, yet most methods are primarily geared toward analyzing quantitative traits. Here we investigate the validity of three common methods for estimating SNP-based heritability and genetic correlation between diseases. We find that the phenotype-correlation-genotype-correlation (PCGC) approach is the only method that can estimate both quantities accurately in the presence of important non-genetic risk factors, such as age and sex. We extend PCGC to work with arbitrary genetic architectures and with summary statistics that take the case-control sampling into account, and we demonstrate that our new method, PCGC-s, accurately estimates both SNP-based heritability and genetic correlations and can be applied to large datasets without requiring individual-level genotypic or phenotypic information. Finally, we use PCGC-s to estimate the genetic correlation between schizophrenia and bipolar disorder and demonstrate that previous estimates are biased, partially due to incorrect handling of sex as a strong risk factor.


Assuntos
Doença/genética , Polimorfismo de Nucleotídeo Único/genética , Estudos de Casos e Controles , Estudos de Associação Genética/métodos , Estudo de Associação Genômica Ampla/métodos , Genótipo , Humanos , Modelos Genéticos , Fenótipo
14.
J Comput Biol ; 25(7): 794-808, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-29932739

RESUMO

Estimation of heritability is an important task in genetics. The use of linear mixed models (LMMs) to determine narrow-sense single-nucleotide polymorphism (SNP)-heritability and related quantities has received much recent attention, due of its ability to account for variants with small effect sizes. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. The common way to report the uncertainty in REML estimation uses standard errors (SEs), which rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals (CIs). In addition, for larger data sets (e.g., tens of thousands of individuals), the construction of SEs itself may require considerable time, as it requires expensive matrix inversions and multiplications. Here, we present FIESTA (Fast confidence IntErvals using STochastic Approximation), a method for constructing accurate CIs. FIESTA is based on parametric bootstrap sampling, and, therefore, avoids unjustified assumptions on the distribution of the heritability estimator. FIESTA uses stochastic approximation techniques, which accelerate the construction of CIs by several orders of magnitude, compared with previous approaches as well as to the analytical approximation used by SEs. FIESTA builds accurate CIs rapidly, for example, requiring only several seconds for data sets of tens of thousands of individuals, making FIESTA a very fast solution to the problem of building accurate CIs for heritability for all data set sizes.


Assuntos
Estudo de Associação Genômica Ampla/estatística & dados numéricos , Modelos Estatísticos , Locos de Características Quantitativas/genética , Simulação por Computador , Genótipo , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , Software
15.
Nephrol Dial Transplant ; 33(2): 323-330, 2018 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-28339911

RESUMO

Background: Inheritance of apolipoprotein L1 gene (APOL1) renal-risk variants in a recessive pattern strongly associates with non-diabetic end-stage kidney disease (ESKD). Further evidence supports risk modifiers in APOL1-associated nephropathy; some studies demonstrate that heterozygotes possess excess risk for ESKD or show earlier age at ESKD, relative to those with zero risk alleles. Nearby loci are also associated with ESKD in non-African Americans. Methods: We assessed the role of the APOL3 null allele rs11089781 on risk of non-diabetic ESKD. Four cohorts containing 2781 ESKD cases and 2474 controls were analyzed. Results: Stratifying by APOL1 risk genotype (recessive) and adjusting for African ancestry identified a significant additive association between rs11089781 and ESKD in each stratum and in a meta-analysis [meta-analysis P = 0.0070; odds ratio (OR) = 1.29]; ORs were consistent across APOL1 risk strata. The biological significance of this association is supported by the finding that the APOL3 gene is co-regulated with APOL1, and that APOL3 protein was able to bind to APOL1 protein. Conclusions: Taken together, the genetic and biological data support the concept that other APOL proteins besides APOL1 may also influence the risk of non-diabetic ESKD.


Assuntos
Apolipoproteínas L/genética , Predisposição Genética para Doença , Glomerulonefrite/genética , Glomerulosclerose Segmentar e Focal/genética , Falência Renal Crônica/genética , Polimorfismo de Nucleotídeo Único , Estudos de Casos e Controles , Genótipo , Humanos , Metanálise como Assunto , Prognóstico
16.
Neuroimage ; 164: 112-120, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-28274834

RESUMO

The cortical layers are a finger print of brain development, function, connectivity and pathology. Obviously, the formation of the layers and their composition is essential to cognition and behavior. The layers were traditionally measured by histological means but recent studies utilizing MRI suggested that T1 relaxation imaging consist of enough contrast to separate the layers. Indeed extreme resolution, post mortem, studies demonstrated this phenomenon. Yet, one of the limiting factors of using T1 MRI to visualize the layers in neuroimaging research is partial volume effect. This happen when the image resolution is not high enough and two or more layers resides within the same voxel. In this paper we demonstrate that due to the physical small thickness of the layers it is highly unlikely that high resolution imaging could resolve the layers. By contrast, we suggest that low resolution multi T1 mapping conjugate with composition analysis could provide practical means for measuring the T1 layers. We suggest an acquisition platform that is clinically feasible and could quantify measures of the layers. The key feature of the suggested platform is that separation of the layers is better achieved in the T1 relaxation domain rather than in the spatial image domain.


Assuntos
Mapeamento Encefálico/métodos , Córtex Cerebral/diagnóstico por imagem , Processamento de Imagem Assistida por Computador/métodos , Imageamento por Ressonância Magnética/métodos , Adulto , Animais , Feminino , Humanos , Masculino , Ratos
17.
Genetics ; 207(4): 1275-1283, 2017 12.
Artigo em Inglês | MEDLINE | ID: mdl-29025915

RESUMO

Testing for the existence of variance components in linear mixed models is a fundamental task in many applicative fields. In statistical genetics, the score test has recently become instrumental in the task of testing an association between a set of genetic markers and a phenotype. With few markers, this amounts to set-based variance component tests, which attempt to increase power in association studies by aggregating weak individual effects. When the entire genome is considered, it allows testing for the heritability of a phenotype, defined as the proportion of phenotypic variance explained by genetics. In the popular score-based Sequence Kernel Association Test (SKAT) method, the assumed distribution of the score test statistic is uncalibrated in small samples, with a correction being computationally expensive. This may cause severe inflation or deflation of P-values, even when the null hypothesis is true. Here, we characterize the conditions under which this discrepancy holds, and show it may occur also in large real datasets, such as a dataset from the Wellcome Trust Case Control Consortium 2 (n = 13,950) study, and, in particular, when the individuals in the sample are unrelated. In these cases, the SKAT approximation tends to be highly overconservative and therefore underpowered. To address this limitation, we suggest an efficient method to calculate exact P-values for the score test in the case of a single variance component and a continuous response vector, which can speed up the analysis by orders of magnitude. Our results enable fast and accurate application of the score test in heritability and in set-based association tests. Our method is available in http://github.com/cozygene/RL-SKAT.


Assuntos
Estudos de Associação Genética/estatística & dados numéricos , Marcadores Genéticos , Variação Genética , Genoma/genética , Algoritmos , Simulação por Computador , Humanos , Modelos Genéticos , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , Software
18.
Bioinformatics ; 33(14): i325-i332, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28881982

RESUMO

MOTIVATION: Epigenome-wide association studies can provide novel insights into the regulation of genes involved in traits and diseases. The rapid emergence of bisulfite-sequencing technologies enables performing such genome-wide studies at the resolution of single nucleotides. However, analysis of data produced by bisulfite-sequencing poses statistical challenges owing to low and uneven sequencing depth, as well as the presence of confounding factors. The recently introduced Mixed model Association for Count data via data AUgmentation (MACAU) can address these challenges via a generalized linear mixed model when confounding can be encoded via a single variance component. However, MACAU cannot be used in the presence of multiple variance components. Additionally, MACAU uses a computationally expensive Markov Chain Monte Carlo (MCMC) procedure, which cannot directly approximate the model likelihood. RESULTS: We present a new method, Mixed model Association via a Laplace ApproXimation (MALAX), that is more computationally efficient than MACAU and allows to model multiple variance components. MALAX uses a Laplace approximation rather than MCMC based approximations, which enables to directly approximate the model likelihood. Through an extensive analysis of simulated and real data, we demonstrate that MALAX successfully addresses statistical challenges introduced by bisulfite-sequencing while controlling for complex sources of confounding, and can be over 50% faster than the state of the art. AVAILABILITY AND IMPLEMENTATION: The full source code of MALAX is available at https://github.com/omerwe/MALAX . CONTACT: omerw@cs.technion.ac.il or ehalperin@cs.ucla.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metilação de DNA , Epigenômica/métodos , Análise de Sequência de DNA/métodos , Software , Humanos , Cadeias de Markov , Método de Monte Carlo , Sulfitos
19.
IEEE Trans Pattern Anal Mach Intell ; 39(11): 2142-2153, 2017 11.
Artigo em Inglês | MEDLINE | ID: mdl-28114007

RESUMO

Recursive partitioning methods producing tree-like models are a long standing staple of predictive modeling. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. We propose a framework to splitting using leave-one-out (LOO) cross validation (CV) for selecting the splitting variable, then performing a regular split (in our case, following CART's approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not substantially increase the overall computational complexity compared to CART for two-class classification.

20.
Mitochondrial DNA A DNA Mapp Seq Anal ; 28(2): 250-253, 2017 03.
Artigo em Inglês | MEDLINE | ID: mdl-26713725

RESUMO

The mitochondrial DNA (mtDNA) control region is a highly variable segment that contains functional elements that control mtDNA transcription and replication. By analysis of the polymorphic nucleotide spectrum of that segment, we aimed to identify the most conserved sites that should be associated with these elements. For that aim, we analyzed 50 033 human mtDNA control region sequences (mtDNA positions 16 066-16 374). We identified 10 conserved tri-nucleotides, one conserved tetra-nucleotide, and one conserved penta-nucleotide, containing six repetitions of the motif CAT, and two of its complement motif ATG (p value < 2 × 10 - 4). Three other appearances of the tri-nucleotide CAT were almost perfectly preserved. The positions of the preserved CAT elements are associated with the location of previously identified termination-associated sequences (TAS) which are the binding locations for proteins involved in mtDNA replication. We, therefore, hypothesize that the CAT tri-nucleotide elements within the control region may be the binding sites for TAS proteins and are directly involved in mtDNA transcription and replication.


Assuntos
DNA Mitocondrial/genética , DNA Mitocondrial/metabolismo , Proteínas de Ligação a DNA/metabolismo , Sequência de Bases , Sítios de Ligação , Sequência Conservada , Replicação do DNA , DNA Mitocondrial/química , Genoma Mitocondrial , Humanos , Mitocôndrias/química , Mitocôndrias/genética , Mitocôndrias/metabolismo , Ligação Proteica , Repetições de Trinucleotídeos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...