Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Proc Natl Acad Sci U S A ; 121(3): e2318989121, 2024 Jan 16.
Artigo em Inglês | MEDLINE | ID: mdl-38215186

RESUMO

The continuous-time Markov chain (CTMC) is the mathematical workhorse of evolutionary biology. Learning CTMC model parameters using modern, gradient-based methods requires the derivative of the matrix exponential evaluated at the CTMC's infinitesimal generator (rate) matrix. Motivated by the derivative's extreme computational complexity as a function of state space cardinality, recent work demonstrates the surprising effectiveness of a naive, first-order approximation for a host of problems in computational biology. In response to this empirical success, we obtain rigorous deterministic and probabilistic bounds for the error accrued by the naive approximation and establish a "blessing of dimensionality" result that is universal for a large class of rate matrices with random entries. Finally, we apply the first-order approximation within surrogate-trajectory Hamiltonian Monte Carlo for the analysis of the early spread of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) across 44 geographic regions that comprise a state space of unprecedented dimensionality for unstructured (flexible) CTMC models within evolutionary biology.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , Algoritmos , COVID-19/epidemiologia , Cadeias de Markov
2.
Syst Biol ; 2024 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-38712512

RESUMO

Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.

3.
PLoS Comput Biol ; 19(8): e1011419, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37639445

RESUMO

Inferring dependencies between mixed-type biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck-integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to study Aquilegia flower and pollinator co-evolution.


Assuntos
Vírus da Influenza A Subtipo H1N1 , Teorema de Bayes , Vírus da Influenza A Subtipo H1N1/genética , Filogenia , Flores , Glicosilação
4.
Bioinformatics ; 38(7): 1846-1856, 2022 03 28.
Artigo em Inglês | MEDLINE | ID: mdl-35040956

RESUMO

SUMMARY: Mutations sometimes increase contagiousness for evolving pathogens. During an epidemic, scientists use viral genome data to infer a shared evolutionary history and connect this history to geographic spread. We propose a model that directly relates a pathogen's evolution to its spatial contagion dynamics-effectively combining the two epidemiological paradigms of phylogenetic inference and self-exciting process modeling-and apply this phylogenetic Hawkes process to a Bayesian analysis of 23 421 viral cases from the 2014 to 2016 Ebola outbreak in West Africa. The proposed model is able to detect individual viruses with significantly elevated rates of spatiotemporal propagation for a subset of 1610 samples that provide genome data. Finally, to facilitate model application in big data settings, we develop massively parallel implementations for the gradient and Hessian of the log-likelihood and apply our high-performance computing framework within an adaptively pre-conditioned Hamiltonian Monte Carlo routine. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Doença pelo Vírus Ebola , Humanos , Teorema de Bayes , Filogenia , Surtos de Doenças , Genoma Viral
5.
Sci Rep ; 14(1): 8848, 2024 04 17.
Artigo em Inglês | MEDLINE | ID: mdl-38632390

RESUMO

UK Biobank is a large-scale epidemiological resource for investigating prospective correlations between various lifestyle, environmental, and genetic factors with health and disease progression. In addition to individual subject information obtained through surveys and physical examinations, a comprehensive neuroimaging battery consisting of multiple modalities provides imaging-derived phenotypes (IDPs) that can serve as biomarkers in neuroscience research. In this study, we augment the existing set of UK Biobank neuroimaging structural IDPs, obtained from well-established software libraries such as FSL and FreeSurfer, with related measurements acquired through the Advanced Normalization Tools Ecosystem. This includes previously established cortical and subcortical measurements defined, in part, based on the Desikan-Killiany-Tourville atlas. Also included are morphological measurements from two recent developments: medial temporal lobe parcellation of hippocampal and extra-hippocampal regions in addition to cerebellum parcellation and thickness based on the Schmahmann anatomical labeling. Through predictive modeling, we assess the clinical utility of these IDP measurements, individually and in combination, using commonly studied phenotypic correlates including age, fluid intelligence, numeric memory, and several other sociodemographic variables. The predictive accuracy of these IDP-based models, in terms of root-mean-squared-error or area-under-the-curve for continuous and categorical variables, respectively, provides comparative insights between software libraries as well as potential clinical interpretability. Results demonstrate varied performance between package-based IDP sets and their combination, emphasizing the need for careful consideration in their selection and utilization.


Assuntos
Bancos de Espécimes Biológicos , Biobanco do Reino Unido , Ecossistema , Estudos Prospectivos , Neuroimagem/métodos , Fenótipo , Imageamento por Ressonância Magnética/métodos , Encéfalo
6.
J Multivar Anal ; 1942023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-37799825

RESUMO

We present the simplicial sampler, a class of parallel MCMC methods that generate and choose from multiple proposals at each iteration. The algorithm's multiproposal randomly rotates a simplex connected to the current Markov chain state in a way that inherently preserves symmetry between proposals. As a result, the simplicial sampler leads to a simplified acceptance step: it simply chooses from among the simplex nodes with probability proportional to their target density values. We also investigate a multivariate Gaussian-based symmetric multiproposal mechanism and prove that it also enjoys the same simplified acceptance step. This insight leads to significant theoretical and practical speedups. While both algorithms enjoy natural parallelizability, we show that conventional implementations are sufficient to confer efficiency gains across an array of dimensions and a number of target distributions.

7.
J Comput Graph Stat ; 32(4): 1402-1415, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38127472

RESUMO

We propose a novel hybrid quantum computing strategy for parallel MCMC algorithms that generate multiple proposals at each step. This strategy makes the rate-limiting step within parallel MCMC amenable to quantum parallelization by using the Gumbel-max trick to turn the generalized accept-reject step into a discrete optimization problem. When combined with new insights from the parallel MCMC literature, such an approach allows us to embed target density evaluations within a well-known extension of Grover's quantum search algorithm. Letting PdenotethenumberofproposalsinasingleMCMCiteration,thecombinedstrategyreducesthenumberoftargetevaluationsrequiredfrom𝒪(P)to𝒪P1/2. In the following, we review the rudiments of quantum computing, quantum search and the Gumbel-max trick in order to elucidate their combination for as wide a readership as possible.

8.
ArXiv ; 2023 Sep 25.
Artigo em Inglês | MEDLINE | ID: mdl-36994154

RESUMO

Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.

9.
Res Sq ; 2023 Oct 30.
Artigo em Inglês | MEDLINE | ID: mdl-37961236

RESUMO

UK Biobank is a large-scale epidemiological resource for investigating prospective correlations between various lifestyle, environmental, and genetic factors with health and disease progression. In addition to individual subject information obtained through surveys and physical examinations, a comprehensive neuroimaging battery consisting of multiple modalities provides imaging-derived phenotypes (IDPs) that can serve as biomarkers in neuroscience research. In this study, we augment the existing set of UK Biobank neuroimaging structural IDPs, obtained from well-established software libraries such as FSL and FreeSurfer, with related measurements acquired through the Advanced Normalization Tools Ecosystem. This includes previously established cortical and subcortical measurements defined, in part, based on the Desikan-Killiany-Tourville atlas. Also included are morphological measurements from two recent developments: medial temporal lobe parcellation of hippocampal and extra-hippocampal regions in addition to cerebellum parcellation and thickness based on the Schmahmann anatomical labeling. Through predictive modeling, we assess the clinical utility of these IDP measurements, individually and in combination, using commonly studied phenotypic correlates including age, fluid intelligence, numeric memory, and several other sociodemographic variables. The predictive accuracy of these IDP-based models, in terms of root-mean-squared-error or area-under-the-curve for continuous and categorical variables, respectively, provides comparative insights between software libraries as well as potential clinical interpretability. Results demonstrate varied performance between package-based IDP sets and their combination, emphasizing the need for careful consideration in their selection and utilization.

10.
Ann Appl Stat ; 16(1): 573-595, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-36211254

RESUMO

Self-exciting spatiotemporal Hawkes processes have found increasing use in the study of large-scale public health threats, ranging from gun violence and earthquakes to wildfires and viral contagion. Whereas many such applications feature locational uncertainty, that is, the exact spatial positions of individual events are unknown, most Hawkes model analyses to date have ignored spatial coarsening present in the data. Three particular 21st century public health crises-urban gun violence, rural wildfires and global viral spread-present qualitatively and quantitatively varying uncertainty regimes that exhibit: (a) different collective magnitudes of spatial coarsening, (b) uniform and mixed magnitude coarsening, (c) differently shaped uncertainty regions and-less orthodox-(d) locational data distributed within the "wrong" effective space. We explicitly model such uncertainties in a Bayesian manner and jointly infer unknown locations together with all parameters of a reasonably flexible Hawkes model, obtaining results that are practically and statistically distinct from those obtained while ignoring spatial coarsening. This work also features two different secondary contributions: first, to facilitate Bayesian inference of locations and background rate parameters, we make a subtle yet crucial change to an established kernel-based rate model, and second, to facilitate the same Bayesian inference at scale, we develop a massively parallel implementation of the model's log-likelihood gradient with respect to locations and thus avoid its quadratic computational cost in the context of Hamiltonian Monte Carlo. Our examples involve thousands of observations and allow us to demonstrate practicality at moderate scales.

11.
Methods Ecol Evol ; 13(10): 2181-2197, 2022 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-36908682

RESUMO

Biological phenotypes are products of complex evolutionary processes in which selective forces influence multiple biological trait measurements in unknown ways. Phylogenetic comparative methods seek to disentangle these relationships across the evolutionary history of a group of organisms. Unfortunately, most existing methods fail to accommodate high-dimensional data with dozens or even thousands of observations per taxon. Phylogenetic factor analysis offers a solution to the challenge of dimensionality. However, scientists seeking to employ this modeling framework confront numerous modeling and implementation decisions, the details of which pose computational and replicability challenges.We develop new inference techniques that increase both the computational efficiency and modeling flexibility of phylogenetic factor analysis. To facilitate adoption of these new methods, we present a practical analysis plan that guides researchers through the web of complex modeling decisions. We codify this analysis plan in an automated pipeline that distills the potentially overwhelming array of decisions into a small handful of (typically binary) choices.We demonstrate the utility of these methods and analysis plan in four real-world problems of varying scales. Specifically, we study floral phenotype and pollination in columbines, domestication in industrial yeast, life history in mammals, and brain morphology in New World monkeys.General and impactful community employment of these methods requires a data scientific analysis plan that balances flexibility, speed and ease of use, while minimizing model and algorithm tuning. Even in the presence of non-trivial phylogenetic model constraints, we show that one may analytically address latent factor uncertainty in a way that (a) aids model flexibility, (b) accelerates computation (by as much as 500-fold) and (c) decreases required tuning. These efforts coalesce to create an accessible Bayesian approach to high-dimensional phylogenetic comparative methods on large trees.

12.
Stat Comput ; 31(1)2021 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-34354329

RESUMO

The Hawkes process and its extensions effectively model self-excitatory phenomena including earthquakes, viral pandemics, financial transactions, neural spike trains and the spread of memes through social networks. The usefulness of these stochastic process models within a host of economic sectors and scientific disciplines is undercut by the processes' computational burden: complexity of likelihood evaluations grows quadratically in the number of observations for both the temporal and spatiotemporal Hawkes processes. We show that, with care, one may parallelize these calculations using both central and graphics processing unit implementations to achieve over 100-fold speedups over single-core processing. Using a simple adaptive Metropolis-Hastings scheme, we apply our high-performance computing framework to a Bayesian analysis of big gunshot data generated in Washington D.C. between the years of 2006 and 2019, thereby extending a past analysis of the same data from under 10,000 to over 85,000 observations. To encourage widespread use, we provide hpHawkes, an open-source R package, and discuss high-level implementation and program design for leveraging aspects of computational hardware that become necessary in a big data setting.

13.
J Comput Graph Stat ; 30(1): 11-24, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34168419

RESUMO

Big Bayes is the computationally intensive co-application of big data and large, expressive Bayesian models for the analysis of complex phenomena in scientific inference and statistical learning. Standing as an example, Bayesian multidimensional scaling (MDS) can help scientists learn viral trajectories through space-time, but its computational burden prevents its wider use. Crucial MDS model calculations scale quadratically in the number of observations. We partially mitigate this limitation through massive parallelization using multi-core central processing units, instruction-level vectorization and graphics processing units (GPUs). Fitting the MDS model using Hamiltonian Monte Carlo, GPUs can deliver more than 100-fold speedups over serial calculations and thus extend Bayesian MDS to a big data setting. To illustrate, we employ Bayesian MDS to infer the rate at which different seasonal influenza virus subtypes use worldwide air traffic to spread around the globe. We examine 5392 viral sequences and their associated 14 million pairwise distances arising from the number of commercial airline seats per year between viral sampling locations. To adjust for shared evolutionary history of the viruses, we implement a phylogenetic extension to the MDS model and learn that subtype H3N2 spreads most effectively, consistent with its epidemic success relative to other seasonal influenza subtypes. Finally, we provide MassiveMDS, an open-source, stand-alone C++ library and rudimentary R package, and discuss program design and high-level implementation with an emphasis on important aspects of computing architecture that become relevant at scale.

14.
Sci Rep ; 11(1): 9068, 2021 04 27.
Artigo em Inglês | MEDLINE | ID: mdl-33907199

RESUMO

The Advanced Normalizations Tools ecosystem, known as ANTsX, consists of multiple open-source software libraries which house top-performing algorithms used worldwide by scientific and research communities for processing and analyzing biological and medical imaging data. The base software library, ANTs, is built upon, and contributes to, the NIH-sponsored Insight Toolkit. Founded in 2008 with the highly regarded Symmetric Normalization image registration framework, the ANTs library has since grown to include additional functionality. Recent enhancements include statistical, visualization, and deep learning capabilities through interfacing with both the R statistical project (ANTsR) and Python (ANTsPy). Additionally, the corresponding deep learning extensions ANTsRNet and ANTsPyNet (built on the popular TensorFlow/Keras libraries) contain several popular network architectures and trained models for specific applications. One such comprehensive application is a deep learning analog for generating cortical thickness data from structural T1-weighted brain MRI, both cross-sectionally and longitudinally. These pipelines significantly improve computational efficiency and provide comparable-to-superior accuracy over multiple criteria relative to the existing ANTs workflows and simultaneously illustrate the importance of the comprehensive ANTsX approach as a framework for medical image analysis.


Assuntos
Algoritmos , Encéfalo/anatomia & histologia , Ecossistema , Processamento de Imagem Assistida por Computador/métodos , Imageamento por Ressonância Magnética/métodos , Neuroimagem/métodos , Adulto , Idoso , Humanos , Masculino , Pessoa de Meia-Idade , Software
15.
Alzheimers Dement (Amst) ; 12(1): e12068, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32875052

RESUMO

INTRODUCTION: Loss of entorhinal cortex (EC) layer II neurons represents the earliest Alzheimer's disease (AD) lesion in the brain. Research suggests differing functional roles between two EC subregions, the anterolateral EC (aLEC) and the posteromedial EC (pMEC). METHODS: We use joint label fusion to obtain aLEC and pMEC cortical thickness measurements from serial magnetic resonance imaging scans of 775 ADNI-1 participants (219 healthy; 380 mild cognitive impairment; 176 AD) and use linear mixed-effects models to analyze longitudinal associations among cortical thickness, disease status, and cognitive measures. RESULTS: Group status is reliably predicted by aLEC thickness, which also exhibits greater associations with cognitive outcomes than does pMEC thickness. Change in aLEC thickness is also associated with cerebrospinal fluid amyloid and tau levels. DISCUSSION: Thinning of aLEC is a sensitive structural biomarker that changes over short durations in the course of AD and tracks disease severity-it is a strong candidate biomarker for detection of early AD.

16.
J Alzheimers Dis ; 71(1): 165-183, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31356207

RESUMO

Longitudinal studies of development and disease in the human brain have motivated the acquisition of large neuroimaging data sets and the concomitant development of robust methodological and statistical tools for quantifying neurostructural changes. Longitudinal-specific strategies for acquisition and processing have potentially significant benefits including more consistent estimates of intra-subject measurements while retaining predictive power. Using the first phase of the Alzheimer's Disease Neuroimaging Initiative (ADNI-1) data, comprising over 600 subjects with multiple time points from baseline to 36 months, we evaluate the utility of longitudinal FreeSurfer and Advanced Normalization Tools (ANTs) surrogate thickness values in the context of a linear mixed-effects (LME) modeling strategy. Specifically, we estimate the residual variability and between-subject variability associated with each processing stream as it is known from the statistical literature that minimizing the former while simultaneously maximizing the latter leads to greater scientific interpretability in terms of tighter confidence intervals in calculated mean trends, smaller prediction intervals, and narrower confidence intervals for determining cross-sectional effects. This strategy is evaluated over the entire cortex, as defined by the Desikan-Killiany-Tourville labeling protocol, where comparisons are made with the cross-sectional and longitudinal FreeSurfer processing streams. Subsequent linear mixed effects modeling for identifying diagnostic groupings within the ADNI cohort is provided as supporting evidence for the utility of the proposed ANTs longitudinal framework which provides unbiased structural neuroimage processing and competitive to superior power for longitudinal structural change detection.


Assuntos
Doença de Alzheimer/diagnóstico por imagem , Biomarcadores , Encéfalo/diagnóstico por imagem , Encéfalo/patologia , Estudos Transversais , Progressão da Doença , Feminino , Humanos , Modelos Lineares , Estudos Longitudinais , Masculino , Neuroimagem
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA