RESUMO
Mendelian randomization (MR) utilizes genome-wide association study (GWAS) summary data to infer causal relationships between exposures and outcomes, offering a valuable tool for identifying disease risk factors. Multivariable MR (MVMR) estimates the direct effects of multiple exposures on an outcome. This study tackles the issue of highly correlated exposures commonly observed in metabolomic data, a situation where existing MVMR methods often face reduced statistical power due to multicollinearity. We propose a robust extension of the MVMR framework that leverages constrained maximum likelihood (cML) and employs a Bayesian approach for identifying independent clusters of exposure signals. Applying our method to the UK Biobank metabolomic data for the largest Alzheimer disease (AD) cohort through a two-sample MR approach, we identified two independent signal clusters for AD: glutamine and lipids, with posterior inclusion probabilities (PIPs) of 95.0% and 81.5%, respectively. Our findings corroborate the hypothesized roles of glutamate and lipids in AD, providing quantitative support for their potential involvement.
Assuntos
Doença de Alzheimer , Teorema de Bayes , Estudo de Associação Genômica Ampla , Análise da Randomização Mendeliana , Metabolômica , Humanos , Doença de Alzheimer/genética , Metabolômica/métodos , Polimorfismo de Nucleotídeo Único , Glutamina/metabolismo , Glutamina/genética , Lipídeos/sangue , Lipídeos/genéticaRESUMO
We have recently introduced MAPLE (MAximum Parsimonious Likelihood Estimation), a new pandemic-scale phylogenetic inference method exclusively designed for genomic epidemiology. In response to the need for enhancing MAPLE's performance and scalability, here we present two key components: (i) CMAPLE software, a highly optimized C++ reimplementation of MAPLE with many new features and advancements, and (ii) CMAPLE library, a suite of application programming interfaces to facilitate the integration of the CMAPLE algorithm into existing phylogenetic inference packages. Notably, we have successfully integrated CMAPLE into the widely used IQ-TREE 2 software, enabling its rapid adoption in the scientific community. These advancements serve as a vital step toward better preparedness for future pandemics, offering researchers powerful tools for large-scale pathogen genomic analysis.
Assuntos
Filogenia , Software , Algoritmos , Pandemias , Funções Verossimilhança , HumanosRESUMO
Profile mixture models capture distinct biochemical constraints on the amino acid substitution process at different sites in proteins. These models feature a mixture of time-reversible models with a common matrix of exchangeabilities and distinct sets of equilibrium amino acid frequencies known as profiles. Combining the exchangeability matrix with each profile generates the matrix of instantaneous rates of amino acid exchange for that profile. Currently, empirically estimated exchangeability matrices (e.g. the LG matrix) are widely used for phylogenetic inference under profile mixture models. However, these were estimated using a single profile and are unlikely optimal for profile mixture models. Here, we describe the GTRpmix model that allows maximum likelihood estimation of a common exchangeability matrix under any profile mixture model. We show that exchangeability matrices estimated under profile mixture models differ from the LG matrix, dramatically improving model fit and topological estimation accuracy for empirical test cases. Because the GTRpmix model is computationally expensive, we provide two exchangeability matrices estimated from large concatenated phylogenomic-supermatrices to be used for phylogenetic analyses. One, called Eukaryotic Linked Mixture (ELM), is designed for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes, and the other, Eukaryotic and Archaeal Linked mixture (EAL), for reconstructing relationships between eukaryotes and Archaea. These matrices, combined with profile mixture models, fit data better and have improved topology estimation relative to the LG matrix combined with the same mixture models. Starting with version 2.3.1, IQ-TREE2 allows users to estimate linked exchangeabilities (i.e. amino acid exchange rates) under profile mixture models.
Assuntos
Modelos Genéticos , Filogenia , Archaea/genética , Funções Verossimilhança , Substituição de Aminoácidos , Evolução Molecular , Eucariotos/genéticaRESUMO
Tree tests like the Kishino-Hasegawa (KH) test and chi-square test suffer a selection bias that tests like the Shimodaira-Hasegawa (SH) test and approximately unbiased test were intended to correct. We investigate tree-testing performance in the presence of severe selection bias. The SH test is found to be very conservative and, surprisingly, its uncorrected analog, the KH test has low Type I error even in the presence of extreme selection bias, leading to a recommendation that the SH test be abandoned. A chi-square test is found to usually behave well and but to require correction in extreme cases. We show how topology testing procedures can be used to get support values for splits and compare the likelihood-based support values to the approximate likelihood ratio test (aLRT) support values. We find that the aLRT support values are reasonable even in settings with severe selection bias that they were not designed for. We also show how they can be used to construct tests of topologies and, in doing so, point out a multiple comparisons issue that should be considered when looking at support values for splits.
Assuntos
Funções Verossimilhança , Filogenia , Viés de SeleçãoRESUMO
The rich longitudinal individual level data available from electronic health records (EHRs) can be used to examine treatment effect heterogeneity. However, estimating treatment effects using EHR data poses several challenges, including time-varying confounding, repeated and temporally non-aligned measurements of covariates, treatment assignments and outcomes, and loss-to-follow-up due to dropout. Here, we develop the subgroup discovery for longitudinal data algorithm, a tree-based algorithm for discovering subgroups with heterogeneous treatment effects using longitudinal data by combining the generalized interaction tree algorithm, a general data-driven method for subgroup discovery, with longitudinal targeted maximum likelihood estimation. We apply the algorithm to EHR data to discover subgroups of people living with human immunodeficiency virus who are at higher risk of weight gain when receiving dolutegravir (DTG)-containing antiretroviral therapies (ARTs) versus when receiving non-DTG-containing ARTs.
Assuntos
Registros Eletrônicos de Saúde , Infecções por HIV , Compostos Heterocíclicos com 3 Anéis , Piperazinas , Piridonas , Humanos , Heterogeneidade da Eficácia do Tratamento , Oxazinas , Infecções por HIV/tratamento farmacológicoRESUMO
Online phylogenetic inference methods add sequentially arriving sequences to an inferred phylogeny without the need to recompute the entire tree from scratch. Some online method implementations exist already, but there remains concern that additional sequences may change the topological relationship among the original set of taxa. We call such a change in tree topology a lack of stability for the inferred tree. In this paper, we analyze the stability of single taxon addition in a Maximum Likelihood framework across 1, 000 empirical datasets. We find that instability occurs in almost 90% of our examples, although observed topological differences do not always reach significance under the AU-test. Changes in tree topology after addition of a taxon rarely occur close to its attachment location, and are more frequently observed in more distant tree locations carrying low bootstrap support. To investigate whether instability is predictable, we hypothesize sources of instability and design summary statistics addressing these hypotheses. Using these summary statistics as input features for machine learning under random forests, we are able to predict instability and can identify the most influential features. In summary, it does not appear that a strict insertion-only online inference method will deliver globally optimal trees, although relaxing insertion strictness by allowing for a small number of final tree rearrangements or accepting slightly suboptimal solutions appears feasible.
RESUMO
Maximum likelihood (ML) phylogenetic inference is widely used in phylogenomics. As heuristic searches most likely find suboptimal trees, it is recommended to conduct multiple (e.g., 10) tree searches in phylogenetic analyses. However, beyond its positive role, how and to what extent multiple tree searches aid ML phylogenetic inference remains poorly explored. Here, we found that a random starting tree was not as effective as the BioNJ and parsimony starting trees in inferring the ML gene tree and that RAxML-NG and PhyML were less sensitive to different starting trees than IQ-TREE. We then examined the effect of the number of tree searches on ML tree inference with IQ-TREE and RAxML-NG, by running 100 tree searches on 19,414 gene alignments from 15 animal, plant, and fungal phylogenomic datasets. We found that the number of tree searches substantially impacted the recovery of the best-of-100 ML gene tree topology among 100 searches for a given ML program. In addition, all of the concatenation-based trees were topologically identical if the number of tree searches was ≥10. Quartet-based ASTRAL trees inferred from 1 to 80 tree searches differed topologically from those inferred from 100 tree searches for 6/15 phylogenomic datasets. Finally, our simulations showed that gene alignments with lower difficulty scores had a higher chance of finding the best-of-100 gene tree topology and were more likely to yield the correct trees.
Assuntos
Classificação , Filogenia , Classificação/métodos , Funções Verossimilhança , Animais , Genômica/métodos , Plantas/classificação , Plantas/genéticaRESUMO
SignificanceThe analysis of complex systems with many degrees of freedom generally involves the definition of low-dimensional collective variables more amenable to physical understanding. Their dynamics can be modeled by generalized Langevin equations, whose coefficients have to be estimated from simulations of the initial high-dimensional system. These equations feature a memory kernel describing the mutual influence of the low-dimensional variables and their environment. We introduce and implement an approach where the generalized Langevin equation is designed to maximize the statistical likelihood of the observed data. This provides an efficient way to generate reduced models to study dynamical properties of complex processes such as chemical reactions in solution, conformational changes in biomolecules, or phase transitions in condensed matter systems.
Assuntos
Simulação de Dinâmica Molecular , Funções VerossimilhançaRESUMO
There is an increasing interest in using multiple types of omics features (e.g., DNA sequences, RNA expressions, methylation, protein expressions, and metabolic profiles) to study how the relationships between phenotypes and genotypes may be mediated by other omics markers. Genotypes and phenotypes are typically available for all subjects in genetic studies, but typically, some omics data will be missing for some subjects, due to limitations such as cost and sample quality. In this article, we propose a powerful approach for mediation analysis that accommodates missing data among multiple mediators and allows for various interaction effects. We formulate the relationships among genetic variants, other omics measurements, and phenotypes through linear regression models. We derive the joint likelihood for models with two mediators, accounting for arbitrary patterns of missing values. Utilizing computationally efficient and stable algorithms, we conduct maximum likelihood estimation. Our methods produce unbiased and statistically efficient estimators. We demonstrate the usefulness of our methods through simulation studies and an application to the Metabolic Syndrome in Men study.
Assuntos
Análise de Mediação , Modelos Genéticos , Humanos , Genótipo , Simulação por Computador , Funções Verossimilhança , AlgoritmosRESUMO
BACKGROUND: Chemoreception is crucial for insect fitness, underlying for instance food-, host-, and mate finding. Chemicals in the environment are detected by receptors from three divergent gene families: odorant receptors (ORs), gustatory receptors (GRs), and ionotropic receptors (IRs). However, how the chemoreceptor gene families evolve in parallel with ecological specializations remains poorly understood, especially in the order Coleoptera. Hence, we sequenced the genome and annotated the chemoreceptor genes of the specialised ambrosia beetle Trypodendron lineatum (Coleoptera, Curculionidae, Scolytinae) and compared its chemoreceptor gene repertoires with those of other scolytines with different ecological adaptations, as well as a polyphagous cerambycid species. RESULTS: We identified 67 ORs, 38 GRs, and 44 IRs in T. lineatum ('Tlin'). Across gene families, T. lineatum has fewer chemoreceptors compared to related scolytines, the coffee berry borer Hypothenemus hampei and the mountain pine beetle Dendroctonus ponderosae, and clearly fewer receptors than the polyphagous cerambycid Anoplophora glabripennis. The comparatively low number of chemoreceptors is largely explained by the scarcity of large receptor lineage radiations, especially among the bitter taste GRs and the 'divergent' IRs, and the absence of alternatively spliced GR genes. Only one non-fructose sugar receptor was found, suggesting several sugar receptors have been lost. Also, we found no orthologue in the 'GR215 clade', which is widely conserved across Coleoptera. Two TlinORs are orthologous to ORs that are functionally conserved across curculionids, responding to 2-phenylethanol (2-PE) and green leaf volatiles (GLVs), respectively. CONCLUSIONS: Trypodendron lineatum reproduces inside the xylem of decaying conifers where it feeds on its obligate fungal mutualist Phialophoropsis ferruginea. Like previous studies, our results suggest that stenophagy correlates with small chemoreceptor numbers in wood-boring beetles; indeed, the few GRs may be due to its restricted fungal diet. The presence of TlinORs orthologous to those detecting 2-PE and GLVs in other species suggests these compounds are important for T. lineatum. Future functional studies should test this prediction, and chemoreceptor annotations should be conducted on additional ambrosia beetle species to investigate whether few chemoreceptors is a general trait in this specialized group of beetles.
Assuntos
Receptores Odorantes , Animais , Receptores Odorantes/genética , Receptores Odorantes/metabolismo , Besouros/genética , Filogenia , Proteínas de Insetos/genética , Proteínas de Insetos/metabolismoRESUMO
Repeated runs of the same program can generate different molecular phylogenies from identical data sets under the same analytical conditions. This lack of reproducibility of inferred phylogenies casts a long shadow on downstream research employing these phylogenies in areas such as comparative genomics, systematics, and functional biology. We have assessed the relative accuracies and log-likelihoods of alternative phylogenies generated for computer-simulated and empirical data sets. Our findings indicate that these alternative phylogenies reconstruct evolutionary relationships with comparable accuracy. They also have similar log-likelihoods that are not inferior to the log-likelihoods of the true tree. We determined that the direct relationship between irreproducibility and inaccuracy is due to their common dependence on the amount of phylogenetic information in the data. While computational reproducibility can be enhanced through more extensive heuristic searches for the maximum likelihood tree, this does not lead to higher accuracy. We conclude that computational irreproducibility plays a minor role in molecular phylogenetics.
Assuntos
Evolução Biológica , Genômica , Filogenia , Reprodutibilidade dos Testes , Simulação por ComputadorRESUMO
Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).
Assuntos
Algoritmos , Filogenia , Funções Verossimilhança , Alinhamento de SequênciaRESUMO
Likelihood-based tests of phylogenetic trees are a foundation of modern systematics. Over the past decade, an enormous wealth and diversity of model-based approaches have been developed for phylogenetic inference of both gene trees and species trees. However, while many techniques exist for conducting formal likelihood-based tests of gene trees, such frameworks are comparatively underdeveloped and underutilized for testing species tree hypotheses. To date, widely used tests of tree topology are designed to assess the fit of classical models of molecular sequence data and individual gene trees and thus are not readily applicable to the problem of species tree inference. To address this issue, we derive several analogous likelihood-based approaches for testing topologies using modern species tree models and heuristic algorithms that use gene tree topologies as input for maximum likelihood estimation under the multispecies coalescent. For the purpose of comparing support for species trees, these tests leverage the statistical procedures of their original gene tree-based counterparts that have an extended history for testing phylogenetic hypotheses at a single locus. We discuss and demonstrate a number of applications, limitations, and important considerations of these tests using simulated and empirical phylogenomic data sets that include both bifurcating topologies and reticulate network models of species relationships. Finally, we introduce the open-source R package SpeciesTopoTestR (SpeciesTopology Tests in R) that includes a suite of functions for conducting formal likelihood-based tests of species topologies given a set of input gene tree topologies.
Assuntos
Algoritmos , Modelos Genéticos , Filogenia , Funções VerossimilhançaRESUMO
Targeted maximum likelihood estimation (TMLE) is increasingly used for doubly robust causal inference, but how missing data should be handled when using TMLE with data-adaptive approaches is unclear. Based on data (1992-1998) from the Victorian Adolescent Health Cohort Study, we conducted a simulation study to evaluate 8 missing-data methods in this context: complete-case analysis, extended TMLE incorporating an outcome-missingness model, the missing covariate missing indicator method, and 5 multiple imputation (MI) approaches using parametric or machine-learning models. We considered 6 scenarios that varied in terms of exposure/outcome generation models (presence of confounder-confounder interactions) and missingness mechanisms (whether outcome influenced missingness in other variables and presence of interaction/nonlinear terms in missingness models). Complete-case analysis and extended TMLE had small biases when outcome did not influence missingness in other variables. Parametric MI without interactions had large bias when exposure/outcome generation models included interactions. Parametric MI including interactions performed best in bias and variance reduction across all settings, except when missingness models included a nonlinear term. When choosing a method for handling missing data in the context of TMLE, researchers must consider the missingness mechanism and, for MI, compatibility with the analysis method. In many settings, a parametric MI approach that incorporates interactions and nonlinearities is expected to perform well.
Assuntos
Causalidade , Humanos , Funções Verossimilhança , Adolescente , Interpretação Estatística de Dados , Viés , Modelos Estatísticos , Simulação por ComputadorRESUMO
To optimize colorectal cancer (CRC) surveillance, accurate information on the risk of developing CRC from premalignant lesions is essential. However, directly observing this risk is challenging since precursor lesions, i.e., advanced adenomas (AAs), are removed upon detection. Statistical methods for multistate models can estimate risks, but estimation is challenging due to low CRC incidence. We propose an outcome-dependent sampling (ODS) design for this problem in which we oversample CRCs. More specifically, we propose a three-state model for jointly estimating the time distributions from baseline colonoscopy to AA and from AA onset to CRC accounting for the ODS design using a weighted likelihood approach. We applied the methodology to a sample from a Norwegian adenoma cohort (1993-2007), comprising 1, 495 individuals (median follow-up 6.8 years [IQR: 1.1 - 12.8 years]) of whom 648 did and 847 did not develop CRC. We observed a 5-year AA risk of 13% and 34% for individuals having non-advanced adenoma (NAA) and AA removed at baseline colonoscopy, respectively. Upon AA development, the subsequent risk to develop CRC in 5 years was 17% and age-dependent. These estimates provide a basis for optimizing surveillance intensity and determining the optimal trade-off between CRC prevention, costs, and use of colonoscopy resources.
RESUMO
The two-phase study design is a cost-efficient sampling strategy when certain data elements are expensive and, thus, can only be collected on a sub-sample of subjects. To date guidance on how best to allocate resources within the design has assumed that primary interest lies in estimating association parameters. When primary interest lies in the development and evaluation of a risk prediction tool, however, such guidance may, in fact, be detrimental. To resolve this, we propose a novel strategy for resource allocation based on oversampling cases and subjects who have more extreme risk estimates according to a preliminary model developed using fully observed predictors. Key to the proposed strategy is that it focuses on enhancing efficiency regarding estimation of measures of predictive accuracy, rather than on efficiency regarding association parameters which is the standard paradigm. Towards valid estimation and inference for accuracy measures using the resultant data, we extend an existing semiparametric maximum likelihood ethod for estimating odds ratio association parameters to accommodate the biased sampling scheme and data incompleteness. Motivated by our sampling design, we additionally propose a general post-stratification scheme for analyzing general two-phase data for estimating predictive accuracy measures. Through theoretical calculations and simulation studies, we show that the proposed sampling strategy and post-stratification scheme achieve the promised efficiency improvement. Finally, we apply the proposed methods to develop and evaluate a preliminary model for predicting the risk of hospital readmission after cardiac surgery using data from the Pennsylvania Health Care Cost Containment Council.
Assuntos
Projetos de Pesquisa , Humanos , Simulação por Computador , ProbabilidadeRESUMO
Metagenomic next-generation sequencing (mNGS) enables comprehensive pathogen detection and has become increasingly popular in clinical diagnosis. The distinct pathogenic traits between strains require mNGS to achieve a strain-level resolution, but an equivocal concept of 'strain' as well as the low pathogen loads in most clinical specimens hinders such strain awareness. Here we introduce a metagenomic intra-species typing (MIST) tool (https://github.com/pandafengye/MIST), which hierarchically organizes reference genomes based on average nucleotide identity (ANI) and performs maximum likelihood estimation to infer the strain-level compositional abundance. In silico analysis using synthetic datasets showed that MIST accurately predicted the strain composition at a 99.9% average nucleotide identity (ANI) resolution with a merely 0.001× sequencing depth. When applying MIST on 359 culture-positive and 359 culture-negative real-world specimens of infected body fluids, we found the presence of multiple-strain reached considerable frequencies (30.39%-93.22%), which were otherwise underestimated by current diagnostic techniques due to their limited resolution. Several high-risk clones were identified to be prevalent across samples, including Acinetobacter baumannii sequence type (ST)208/ST195, Staphylococcus aureus ST22/ST398 and Klebsiella pneumoniae ST11/ST15, indicating potential outbreak events occurring in the clinical settings. Interestingly, contaminations caused by the engineered Escherichia coli strain K-12 and BL21 throughout the mNGS datasets were also identified by MIST instead of the statistical decontamination approach. Our study systemically characterized the infected body fluids at the strain level for the first time. Extension of mNGS testing to the strain level can greatly benefit clinical diagnosis of bacterial infections, including the identification of multi-strain infection, decontamination and infection control surveillance.
Assuntos
Infecções Bacterianas , Líquidos Corporais , Infecções Bacterianas/diagnóstico , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Metagenômica/métodos , NucleotídeosRESUMO
Consider the problem of estimating the branch lengths in a symmetric 2-state substitution model with a known topology and a general, clock-like or star-shaped tree with three leaves. We show that the maximum likelihood estimates are analytically tractable and can be obtained from pairwise sequence comparisons. Furthermore, we demonstrate that this property does not generalize to larger state spaces, more complex models or larger trees. Our arguments are based on an enumeration of the free parameters of the model and the dimension of the minimal sufficient data vector. Our interest in this problem arose from discussions with our former colleague Freddy Bugge Christiansen.
Assuntos
Evolução Molecular , Modelos Genéticos , Funções Verossimilhança , FilogeniaRESUMO
Many hypotheses in the field of phylogenetic comparative biology involve specific changes in the rate or process of trait evolution. This is particularly true of approaches designed to connect macroevolutionary pattern to microevolutionary process. We present a method designed to test whether the rate of evolution of a discrete character has changed in one or more clades, lineages, or time periods. This method differs from other related approaches (such as the 'covarion' model) in that the 'regimes' in which the rate or process is postulated to have changed are specified a priori by the user, rather than inferred from the data. Similarly, it differs from methods designed to model a correlation between two binary traits in that the regimes mapped onto the tree are fixed. We apply our method to investigate the rate of dewlap color and/or caudal vertebra number evolution in Caribbean and mainland clades of the diverse lizard genus Anolis. We find little evidence to support any difference in the evolutionary process between mainland and island evolution for either character. We also examine the statistical properties of the method more generally and show that it has acceptable type I error, parameter estimation, and power. Finally, we discuss some general issues of frequentist hypothesis testing and model adequacy, as well as the relationship of our method to existing models of heterogeneity in the rate of discrete character evolution on phylogenies.
RESUMO
Estimating parameters of amino acid substitution models is a crucial task in bioinformatics. The maximum likelihood (ML) approach has been proposed to estimate amino acid substitution models from large datasets. The quality of newly estimated models is normally assessed by comparing with the existing models in building ML trees. Two important questions remained are the correlation of the estimated models with the true models and the required size of the training datasets to estimate reliable models. In this article, we performed a simulation study to answer these two questions based on simulated data. We simulated genome datasets with different numbers of genes/alignments based on predefined models (called true models) and predefined trees (called true trees). The simulated datasets were used to estimate amino acid substitution model using the ML estimation methods. Our experiments showed that models estimated by the ML methods from simulated datasets with more than 100 genes have high correlations with the true models. The estimated models performed well in building ML trees in comparison with the true models. The results suggest that amino acid substitution models estimated by the ML methods from large genome datasets are a reliable tool for analyzing amino acid sequences.