RESUMO
Mendelian randomization (MR) utilizes genome-wide association study (GWAS) summary data to infer causal relationships between exposures and outcomes, offering a valuable tool for identifying disease risk factors. Multivariable MR (MVMR) estimates the direct effects of multiple exposures on an outcome. This study tackles the issue of highly correlated exposures commonly observed in metabolomic data, a situation where existing MVMR methods often face reduced statistical power due to multicollinearity. We propose a robust extension of the MVMR framework that leverages constrained maximum likelihood (cML) and employs a Bayesian approach for identifying independent clusters of exposure signals. Applying our method to the UK Biobank metabolomic data for the largest Alzheimer disease (AD) cohort through a two-sample MR approach, we identified two independent signal clusters for AD: glutamine and lipids, with posterior inclusion probabilities (PIPs) of 95.0% and 81.5%, respectively. Our findings corroborate the hypothesized roles of glutamate and lipids in AD, providing quantitative support for their potential involvement.
Assuntos
Doença de Alzheimer , Teorema de Bayes , Estudo de Associação Genômica Ampla , Análise da Randomização Mendeliana , Metabolômica , Humanos , Doença de Alzheimer/genética , Metabolômica/métodos , Polimorfismo de Nucleotídeo Único , Glutamina/metabolismo , Glutamina/genética , Lipídeos/sangue , Lipídeos/genéticaRESUMO
Most genome-wide association studies are based on case-control designs, which provide abundant resources for secondary phenotype analyses. However, such studies suffer from biased sampling of primary phenotypes, and the traditional statistical methods can lead to seriously distorted analysis results when they are applied to secondary phenotypes without accounting for the biased sampling mechanism. To our knowledge, there are no statistical methods specifically tailored for rare variant association analysis with secondary phenotypes. In this article, we proposed two novel joint test statistics for identifying secondary-phenotype-associated rare variants based on prospective likelihood and retrospective likelihood, respectively. We also exploit the assumption of gene-environment independence in retrospective likelihood to improve the statistical power and adopt a two-step strategy to balance statistical power and robustness. Simulations and a real-data application are conducted to demonstrate the superior performance of our proposed methods.
RESUMO
We have recently introduced MAPLE (MAximum Parsimonious Likelihood Estimation), a new pandemic-scale phylogenetic inference method exclusively designed for genomic epidemiology. In response to the need for enhancing MAPLE's performance and scalability, here we present two key components: (i) CMAPLE software, a highly optimized C++ reimplementation of MAPLE with many new features and advancements, and (ii) CMAPLE library, a suite of application programming interfaces to facilitate the integration of the CMAPLE algorithm into existing phylogenetic inference packages. Notably, we have successfully integrated CMAPLE into the widely used IQ-TREE 2 software, enabling its rapid adoption in the scientific community. These advancements serve as a vital step toward better preparedness for future pandemics, offering researchers powerful tools for large-scale pathogen genomic analysis.
Assuntos
Filogenia , Software , Algoritmos , Pandemias , Funções Verossimilhança , HumanosRESUMO
Profile mixture models capture distinct biochemical constraints on the amino acid substitution process at different sites in proteins. These models feature a mixture of time-reversible models with a common matrix of exchangeabilities and distinct sets of equilibrium amino acid frequencies known as profiles. Combining the exchangeability matrix with each profile generates the matrix of instantaneous rates of amino acid exchange for that profile. Currently, empirically estimated exchangeability matrices (e.g. the LG matrix) are widely used for phylogenetic inference under profile mixture models. However, these were estimated using a single profile and are unlikely optimal for profile mixture models. Here, we describe the GTRpmix model that allows maximum likelihood estimation of a common exchangeability matrix under any profile mixture model. We show that exchangeability matrices estimated under profile mixture models differ from the LG matrix, dramatically improving model fit and topological estimation accuracy for empirical test cases. Because the GTRpmix model is computationally expensive, we provide two exchangeability matrices estimated from large concatenated phylogenomic-supermatrices to be used for phylogenetic analyses. One, called Eukaryotic Linked Mixture (ELM), is designed for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes, and the other, Eukaryotic and Archaeal Linked mixture (EAL), for reconstructing relationships between eukaryotes and Archaea. These matrices, combined with profile mixture models, fit data better and have improved topology estimation relative to the LG matrix combined with the same mixture models. Starting with version 2.3.1, IQ-TREE2 allows users to estimate linked exchangeabilities (i.e. amino acid exchange rates) under profile mixture models.
Assuntos
Modelos Genéticos , Filogenia , Archaea/genética , Funções Verossimilhança , Substituição de Aminoácidos , Evolução Molecular , Eucariotos/genéticaRESUMO
Tree tests like the Kishino-Hasegawa (KH) test and chi-square test suffer a selection bias that tests like the Shimodaira-Hasegawa (SH) test and approximately unbiased test were intended to correct. We investigate tree-testing performance in the presence of severe selection bias. The SH test is found to be very conservative and, surprisingly, its uncorrected analog, the KH test has low Type I error even in the presence of extreme selection bias, leading to a recommendation that the SH test be abandoned. A chi-square test is found to usually behave well and but to require correction in extreme cases. We show how topology testing procedures can be used to get support values for splits and compare the likelihood-based support values to the approximate likelihood ratio test (aLRT) support values. We find that the aLRT support values are reasonable even in settings with severe selection bias that they were not designed for. We also show how they can be used to construct tests of topologies and, in doing so, point out a multiple comparisons issue that should be considered when looking at support values for splits.
Assuntos
Funções Verossimilhança , Filogenia , Viés de SeleçãoRESUMO
Recommendations from the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) for interpreting sequence variants specify the use of computational predictors as "supporting" level of evidence for pathogenicity or benignity using criteria PP3 and BP4, respectively. However, score intervals defined by tool developers, and ACMG/AMP recommendations that require the consensus of multiple predictors, lack quantitative support. Previously, we described a probabilistic framework that quantified the strengths of evidence (supporting, moderate, strong, very strong) within ACMG/AMP recommendations. We have extended this framework to computational predictors and introduce a new standard that converts a tool's scores to PP3 and BP4 evidence strengths. Our approach is based on estimating the local positive predictive value and can calibrate any computational tool or other continuous-scale evidence on any variant type. We estimate thresholds (score intervals) corresponding to each strength of evidence for pathogenicity and benignity for thirteen missense variant interpretation tools, using carefully assembled independent data sets. Most tools achieved supporting evidence level for both pathogenic and benign classification using newly established thresholds. Multiple tools reached score thresholds justifying moderate and several reached strong evidence levels. One tool reached very strong evidence level for benign classification on some variants. Based on these findings, we provide recommendations for evidence-based revisions of the PP3 and BP4 ACMG/AMP criteria using individual tools and future assessment of computational methods for clinical interpretation.
Assuntos
Calibragem , Humanos , Consenso , Escolaridade , VirulênciaRESUMO
The use of social contact rates is widespread in infectious disease modeling since it has been shown that they are key driving forces of important epidemiological parameters. Quantification of contact patterns is crucial to parameterize dynamic transmission models and to provide insights on the (basic) reproduction number. Information on social interactions can be obtained from population-based contact surveys, such as the European Commission project POLYMOD. Estimation of age-specific contact rates from these studies is often done using a piecewise constant approach or bivariate smoothing techniques. For the latter, typically, smoothness is introduced in the dimensions of the respondent's and contact's age (i.e., the rows and columns of the social contact matrix). We propose a smoothing constrained approach-taking into account the reciprocal nature of contacts-introducing smoothness over the diagonal (including all subdiagonals) of the social contact matrix. This modeling approach is justified assuming that when people age their contact behavior changes smoothly. We call this smoothing from a cohort perspective. Two approaches that allow for smoothing over social contact matrix diagonals are proposed, namely (i) reordering of the diagonal components of the contact matrix and (ii) reordering of the penalty matrix ensuring smoothness over the contact matrix diagonals. Parameter estimation is done in the likelihood framework by using constrained penalized iterative reweighted least squares. A simulation study underlines the benefits of cohort-based smoothing. Finally, the proposed methods are illustrated on the Belgian POLYMOD data of 2006. Code to reproduce the results of the article can be downloaded on this GitHub repository https://github.com/oswaldogressani/Cohort_smoothing.
Assuntos
Doenças Transmissíveis , Humanos , Simulação por Computador , Análise dos Mínimos Quadrados , Probabilidade , Fatores EtáriosRESUMO
Immune response decays over time, and vaccine-induced protection often wanes. Understanding how vaccine efficacy changes over time is critical to guiding the development and application of vaccines in preventing infectious diseases. The objective of this article is to develop statistical methods that assess the effect of decaying immune responses on the risk of disease and on vaccine efficacy, within the context of Cox regression with sparse sampling of immune responses, in a baseline-naive population. We aim to further disentangle the various aspects of the time-varying vaccine effect, whether direct on disease or mediated through immune responses. Based on time-to-event data from a vaccine efficacy trial and sparse sampling of longitudinal immune responses, we propose a weighted estimated induced likelihood approach that models the longitudinal immune response trajectory and the time to event separately. This approach assesses the effects of the decaying immune response, the peak immune response, and/or the waning vaccine effect on the risk of disease. The proposed method is applicable not only to standard randomized trial designs but also to augmented vaccine trial designs that re-vaccinate uninfected placebo recipients at the end of the standard trial period. We conducted simulation studies to evaluate the performance of our method and applied the method to analyze immune correlates from a phase III SARS-CoV-2 vaccine trial.
RESUMO
The rich longitudinal individual level data available from electronic health records (EHRs) can be used to examine treatment effect heterogeneity. However, estimating treatment effects using EHR data poses several challenges, including time-varying confounding, repeated and temporally non-aligned measurements of covariates, treatment assignments and outcomes, and loss-to-follow-up due to dropout. Here, we develop the subgroup discovery for longitudinal data algorithm, a tree-based algorithm for discovering subgroups with heterogeneous treatment effects using longitudinal data by combining the generalized interaction tree algorithm, a general data-driven method for subgroup discovery, with longitudinal targeted maximum likelihood estimation. We apply the algorithm to EHR data to discover subgroups of people living with human immunodeficiency virus who are at higher risk of weight gain when receiving dolutegravir (DTG)-containing antiretroviral therapies (ARTs) versus when receiving non-DTG-containing ARTs.
Assuntos
Registros Eletrônicos de Saúde , Infecções por HIV , Compostos Heterocíclicos com 3 Anéis , Piperazinas , Piridonas , Humanos , Heterogeneidade da Eficácia do Tratamento , Oxazinas , Infecções por HIV/tratamento farmacológicoRESUMO
In recent years, the study of hybridization and introgression has made significant progress, with ghost introgression-the transfer of genetic material from extinct or unsampled lineages to extant species-emerging as a key area for research. Accurately identifying ghost introgression, however, presents a challenge. To address this issue, we focused on simple cases involving 3 species with a known phylogenetic tree. Using mathematical analyses and simulations, we evaluated the performance of popular phylogenetic methods, including HyDe and PhyloNet/MPL, and the full-likelihood method, Bayesian Phylogenetics and Phylogeography (BPP), in detecting ghost introgression. Our findings suggest that heuristic approaches relying on site-pattern counts or gene-tree topologies struggle to differentiate ghost introgression from introgression between sampled non-sister species, frequently leading to incorrect identification of donor and recipient species. The full-likelihood method BPP uses multilocus sequence alignments directly-hence taking into account both gene-tree topologies and branch lengths, by contrast, is capable of detecting ghost introgression in phylogenomic datasets. We analyzed a real-world phylogenomic dataset of 14 species of Jaltomata (Solanaceae) to showcase the potential of full-likelihood methods for accurate inference of introgression.
Assuntos
Classificação , Filogenia , Classificação/métodos , Introgressão Genética , Hibridização Genética , Filogeografia/métodos , Simulação por ComputadorRESUMO
Online phylogenetic inference methods add sequentially arriving sequences to an inferred phylogeny without the need to recompute the entire tree from scratch. Some online method implementations exist already, but there remains concern that additional sequences may change the topological relationship among the original set of taxa. We call such a change in tree topology a lack of stability for the inferred tree. In this paper, we analyze the stability of single taxon addition in a Maximum Likelihood framework across 1, 000 empirical datasets. We find that instability occurs in almost 90% of our examples, although observed topological differences do not always reach significance under the AU-test. Changes in tree topology after addition of a taxon rarely occur close to its attachment location, and are more frequently observed in more distant tree locations carrying low bootstrap support. To investigate whether instability is predictable, we hypothesize sources of instability and design summary statistics addressing these hypotheses. Using these summary statistics as input features for machine learning under random forests, we are able to predict instability and can identify the most influential features. In summary, it does not appear that a strict insertion-only online inference method will deliver globally optimal trees, although relaxing insertion strictness by allowing for a small number of final tree rearrangements or accepting slightly suboptimal solutions appears feasible.
RESUMO
Phylogenies are central to many research areas in biology and commonly estimated using likelihood-based methods. Unfortunately, any likelihood-based method, including Bayesian inference, can be restrictively slow for large datasets-with many taxa and/or many sites in the sequence alignment-or complex substitutions models. The primary limiting factor when using large datasets and/or complex models in probabilistic phylogenetic analyses is the likelihood calculation, which dominates the total computation time. To address this bottleneck, we incorporated the high-performance phylogenetic library BEAGLE into RevBayes, which enables multi-threading on multi-core CPUs and GPUs, as well as hardware specific vectorized instructions for faster likelihood calculations. Our new implementation of RevBayes+BEAGLE retains the flexibility and dynamic nature that users expect from vanilla RevBayes. In addition, we implemented native parallelization within RevBayes without an external library using the message passing interface (MPI); RevBayes+MPI. We evaluated our new implementation of RevBayes+BEAGLE using multi-threading on CPUs and 2 different powerful GPUs (NVidia Titan V and NVIDIA A100) against our native implementation of RevBayes+MPI. We found good improvements in speedup when multiple cores were used, with up to 20-fold speedup when using multiple CPU cores and over 90-fold speedup when using multiple GPU cores. The improvement depended on the data type used, DNA or amino acids, and the size of the alignment, but less on the size of the tree. We additionally investigated the cost of rescaling partial likelihoods to avoid numerical underflow and showed that unnecessarily frequent and inefficient rescaling can increase runtimes up to 4-fold. Finally, we presented and compared a new approach to store partial likelihoods on branches instead of nodes that can speed up computations up to 1.7 times but comes at twice the memory requirements.
Assuntos
Teorema de Bayes , Filogenia , Software , Classificação/métodos , Biologia Computacional/métodosRESUMO
Maximum likelihood (ML) phylogenetic inference is widely used in phylogenomics. As heuristic searches most likely find suboptimal trees, it is recommended to conduct multiple (e.g., ten) tree searches in phylogenetic analyses. However, beyond its positive role, how and to what extent multiple tree searches aid ML phylogenetic inference remains poorly explored. Here, we found that a random starting tree was not as effective as the BioNJ and parsimony starting trees in inferring ML gene tree and that RAxML-NG and PhyML were less sensitive to different starting trees than IQ-TREE. We then examined the effect of the number of tree searches on ML tree inference with IQ-TREE and RAxML-NG, by running 100 tree searches on 19,414 gene alignments from 15 animal, plant, and fungal phylogenomic datasets. We found that the number of tree searches substantially impacted the recovery of the best-of-100 ML gene tree topology among 100 searches for a given ML program. In addition, all of the concatenation-based trees were topologically identical if the number of tree searches was ≥ 10. Quartet-based ASTRAL trees inferred from 1 to 80 tree searches differed topologically from those inferred from 100 tree searches for 6 /15 phylogenomic datasets. Lastly, our simulations showed that gene alignments with lower difficulty scores had a higher chance of finding the best-of-100 gene tree topology and were more likely to yield the correct trees.
RESUMO
Variation in gene tree estimates is widely observed in empirical phylogenomic data and is often assumed to be the result of biological processes. However, a recent study using tetrapod mitochondrial genomes to control for biological sources of variation due to their haploid, uniparentally inherited, and non-recombining nature found that levels of discordance among mitochondrial gene trees were comparable to those found in studies that assume only biological sources of variation. Additionally, they found that several of the models of sequence evolution chosen to infer gene trees were doing an inadequate job fitting the sequence data. These results indicated that significant amounts of gene tree discordance in empirical data may be due to poor fit of sequence evolution models, and that more complex and biologically realistic models may be needed. To test how the fit of sequence evolution models relates to gene tree discordance, we analyzed the same mitochondrial datasets as the previous study using two additional, more complex models of sequence evolution that each includes a different biologically realistic aspect of the evolutionary process: a covarion model to incorporate site-specific rate variation across lineages (heterotachy), and a partitioned model to incorporate variable evolutionary patterns by codon position. Our results show that both additional models fit the data better than the models used in the previous study, with the covarion being consistently and strongly preferred as tree size increases. However, even these more preferred models still inferred highly discordant mitochondrial gene trees, thus deepening the mystery around what we label the "Mito-Phylo Paradox" and leading us to ask whether the observed variation could, in fact, be biological in nature after all.
RESUMO
While phylogenies have been essential in understanding how species evolve, they do not adequately describe some evolutionary processes. For instance, hybridization, a common phenomenon where interbreeding between two species leads to formation of a new species, must be depicted by a phylogenetic network, a structure that modifies a phylogenetic tree by allowing two branches to merge into one, resulting in reticulation. However, existing methods for estimating networks become computationally expensive as the dataset size and/or topological complexity increase. The lack of methods for scalable inference hampers phylogenetic networks from being widely used in practice, despite accumulating evidence that hybridization occurs frequently in nature. Here, we propose a novel method, PhyNEST (Phylogenetic Network Estimation using SiTe patterns), that estimates binary, level-1 phylogenetic networks with a fixed, user-specified number of reticulations directly from sequence data. By using the composite likelihood as the basis for inference, PhyNEST is able to use the full genomic data in a computationally tractable manner, eliminating the need to summarize the data as a set of gene trees prior to network estimation. To search network space, PhyNEST implements both hill climbing and simulated annealing algorithms. PhyNEST assumes that the data are composed of coalescent independent sites that evolve according to the Jukes-Cantor substitution model and that the network has a constant effective population size. Simulation studies demonstrate that PhyNEST is often more accurate than two existing composite likelihood summary methods (SNaQ and PhyloNet) and that it is robust to at least one form of model misspecification (assuming a less complex nucleotide substitution model than the true generating model). We applied PhyNEST to reconstruct the evolutionary relationships among Heliconius butterflies and Papionini primates, characterized by hybrid speciation and widespread introgression, respectively. PhyNEST is implemented in an open-source Julia package and is publicly available at https://github.com/sungsik-kong/PhyNEST.jl.
RESUMO
Speech disorders are associated with different degrees of functional and structural abnormalities. However, the abnormalities associated with specific disorders, and the common abnormalities shown by all disorders, remain unclear. Herein, a meta-analysis was conducted to integrate the results of 70 studies that compared 1843 speech disorder patients (dysarthria, dysphonia, stuttering, and aphasia) to 1950 healthy controls in terms of brain activity, functional connectivity, gray matter, and white matter fractional anisotropy. The analysis revealed that compared to controls, the dysarthria group showed higher activity in the left superior temporal gyrus and lower activity in the left postcentral gyrus. The dysphonia group had higher activity in the right precentral and postcentral gyrus. The stuttering group had higher activity in the right inferior frontal gyrus and lower activity in the left inferior frontal gyrus. The aphasia group showed lower activity in the bilateral anterior cingulate gyrus and left superior frontal gyrus. Across the four disorders, there were concurrent lower activity, gray matter, and fractional anisotropy in motor and auditory cortices, and stronger connectivity between the default mode network and frontoparietal network. These findings enhance our understanding of the neural basis of speech disorders, potentially aiding clinical diagnosis and intervention.
Assuntos
Afasia , Córtex Auditivo , Disfonia , Gagueira , Humanos , Disartria , Funções Verossimilhança , Distúrbios da FalaRESUMO
SignificanceThe analysis of complex systems with many degrees of freedom generally involves the definition of low-dimensional collective variables more amenable to physical understanding. Their dynamics can be modeled by generalized Langevin equations, whose coefficients have to be estimated from simulations of the initial high-dimensional system. These equations feature a memory kernel describing the mutual influence of the low-dimensional variables and their environment. We introduce and implement an approach where the generalized Langevin equation is designed to maximize the statistical likelihood of the observed data. This provides an efficient way to generate reduced models to study dynamical properties of complex processes such as chemical reactions in solution, conformational changes in biomolecules, or phase transitions in condensed matter systems.
Assuntos
Simulação de Dinâmica Molecular , Funções VerossimilhançaRESUMO
BACKGROUND AND AIMS: Whether index testing using coronary computed tomography angiography (CTA) improves outcomes in stable chest pain is debated. The risk factor weighted clinical likelihood (RF-CL) model provides likelihood estimation of obstructive coronary artery disease. This study investigated the prognostic effect of coronary CTA vs. usual care by RF-CL estimates. METHODS: Large-scale studies randomized patients (N = 13 748) with stable chest pain to coronary CTA as part of the initial work-up in addition to or instead of usual care including functional testing. Patients were stratified according to RF-CL estimates [RF-CL: very-low (≤5%), low (>5%-15%), and moderate/high (>15%)]. The primary endpoint was myocardial infarction or death at 3 years. RESULTS: The primary endpoint occurred in 313 (2.3%) patients. Event rates were similar in patients allocated to coronary CTA vs. usual care [risk difference (RD) 0.3%, hazard ratio (HR) 0.84 (95% CI 0.67-1.05)]. Overall, 33%, 44%, and 23% patients had very-low, low, and moderate/high RF-CL. Risk was similar in patients with very low and moderate/high RF-CL allocated to coronary CTA vs. usual care [very low: RD 0.3%, HR 1.27 (0.74-2.16); moderate/high: RD 0.5%, HR 0.88 (0.63-1.23)]. Conversely, patients with low RF-CL undergoing coronary CTA had lower event rates [RD 0.7%, HR 0.67 (95% CI 0.47-0.97)]. The number needed to test using coronary CTA to prevent one event within 3 years was 143. CONCLUSIONS: Despite an overall good prognosis, low RF-CL patients have reduced risk of myocardial infarction or death when allocated to coronary CTA vs. usual care. Risk is similar in patients with very-low and moderate/high likelihood.
RESUMO
There is an increasing interest in using multiple types of omics features (e.g., DNA sequences, RNA expressions, methylation, protein expressions, and metabolic profiles) to study how the relationships between phenotypes and genotypes may be mediated by other omics markers. Genotypes and phenotypes are typically available for all subjects in genetic studies, but typically, some omics data will be missing for some subjects, due to limitations such as cost and sample quality. In this article, we propose a powerful approach for mediation analysis that accommodates missing data among multiple mediators and allows for various interaction effects. We formulate the relationships among genetic variants, other omics measurements, and phenotypes through linear regression models. We derive the joint likelihood for models with two mediators, accounting for arbitrary patterns of missing values. Utilizing computationally efficient and stable algorithms, we conduct maximum likelihood estimation. Our methods produce unbiased and statistically efficient estimators. We demonstrate the usefulness of our methods through simulation studies and an application to the Metabolic Syndrome in Men study.
Assuntos
Análise de Mediação , Modelos Genéticos , Humanos , Genótipo , Simulação por Computador , Funções Verossimilhança , AlgoritmosRESUMO
Variation in RNA-Seq data creates modeling challenges for differential gene expression (DE) analysis. Statistical approaches address conventional small sample sizes and implement empirical Bayes or non-parametric tests, but frequently produce different conclusions. Increasing sample sizes enable proposal of alternative DE paradigms. Here we develop RoPE, which uses a data-driven adjustment for variation and a robust profile likelihood ratio DE test. Simulation studies show RoPE can have improved performance over existing tools as sample size increases and has the most reliable control of error rates. Application of RoPE demonstrates that an active Pseudomonas aeruginosa infection downregulates the SLC9A3 Cystic Fibrosis modifier gene.