RESUMO
The compositional and thermal state of Earth's mantle provides critical constraints on the origin, evolution, and dynamics of Earth. However, the chemical composition and thermal structure of the lower mantle are still poorly understood. Particularly, the nature and origin of the two large low-shear-velocity provinces (LLSVPs) in the lowermost mantle observed from seismological studies are still debated. In this study, we inverted for the 3D chemical composition and thermal state of the lower mantle based on seismic tomography and mineral elasticity data by employing a Markov chain Monte Carlo framework. The results show a silica-enriched lower mantle with a Mg/Si ratio less than ~1.16, lower than that of the pyrolitic upper mantle (Mg/Si = 1.3). The lateral temperature distributions can be described by a Gaussian distribution with a standard deviation (SD) of 120 to 140 K at 800 to 1,600 km and the SD increases to 250 K at 2,200 km depth. However, the lateral distribution in the lowermost mantle does not follow the Gaussian distribution. We found that the velocity heterogeneities in the upper lower mantle mainly result from thermal anomalies, while those in the lowermost mantle mainly result from compositional or phase variations. The LLSVPs have higher density at the base and lower density above the depth of ~2,700 km than the ambient mantle, respectively. The LLSVPs are found to have ~500 K higher temperature, higher Bridgmanite and iron content than the ambient mantle, supporting the hypothesis that the LLSVPs may originate from an ancient basal magma ocean formed in Earth's early history.
RESUMO
The role of balancing selection is a long-standing evolutionary puzzle. Balancing selection is a crucial evolutionary process that maintains genetic variation (polymorphism) over extended periods of time; however, detecting it poses a significant challenge. Building upon the Polymorphism-aware phylogenetic Models (PoMos) framework rooted in the Moran model, we introduce a PoMoBalance model. This novel approach is designed to disentangle the interplay of mutation, genetic drift, and directional selection (GC-biased gene conversion), along with the previously unexplored balancing selection pressures on ultra-long timescales comparable with species divergence times by analyzing multi-individual genomic and phylogenetic divergence data. Implemented in the open-source RevBayes Bayesian framework, PoMoBalance offers a versatile tool for inferring phylogenetic trees as well as quantifying various selective pressures. The novel aspect of our approach in studying balancing selection lies in polymorphism-aware phylogenetic models' ability to account for ancestral polymorphisms and incorporate parameters that measure frequency-dependent selection, allowing us to determine the strength of the effect and exact frequencies under selection. We implemented validation tests and assessed the model on the data simulated with SLiM and a custom Moran model simulator. Real sequence analysis of Drosophila populations reveals insights into the evolutionary dynamics of regions subject to frequency-dependent balancing selection, particularly in the context of sex-limited color dimorphism in Drosophila erecta.
Assuntos
Conversão Gênica , Modelos Genéticos , Filogenia , Polimorfismo Genético , Seleção Genética , Animais , Teorema de Bayes , Evolução Molecular , Masculino , FemininoRESUMO
Biclustering is a useful method for simultaneously grouping samples and features and has been applied across various biomedical data types. However, most existing biclustering methods lack the ability to integratively analyze multi-modal data such as multi-omics data such as genome, transcriptome and epigenome. Moreover, the potential of leveraging biological knowledge represented by graphs, which has been demonstrated to be beneficial in various statistical tasks such as variable selection and prediction, remains largely untapped in the context of biclustering. To address both, we propose a novel Bayesian biclustering method called Bayesian graph-guided biclustering (BGB). Specifically, we introduce a new hierarchical sparsity-inducing prior to effectively incorporate biological graph information and establish a unified framework to model multi-view data. We develop an efficient Markov chain Monte Carlo algorithm to conduct posterior sampling and inference. Extensive simulations and real data analysis show that BGB outperforms other popular biclustering methods. Notably, BGB is robust in terms of utilizing biological knowledge and has the capability to reveal biologically meaningful information from heterogeneous multi-modal data.
Assuntos
Algoritmos , Multiômica , Teorema de Bayes , Análise por Conglomerados , TranscriptomaRESUMO
Phylogenies are central to many research areas in biology and commonly estimated using likelihood-based methods. Unfortunately, any likelihood-based method, including Bayesian inference, can be restrictively slow for large datasets-with many taxa and/or many sites in the sequence alignment-or complex substitutions models. The primary limiting factor when using large datasets and/or complex models in probabilistic phylogenetic analyses is the likelihood calculation, which dominates the total computation time. To address this bottleneck, we incorporated the high-performance phylogenetic library BEAGLE into RevBayes, which enables multi-threading on multi-core CPUs and GPUs, as well as hardware specific vectorized instructions for faster likelihood calculations. Our new implementation of RevBayes+BEAGLE retains the flexibility and dynamic nature that users expect from vanilla RevBayes. In addition, we implemented native parallelization within RevBayes without an external library using the message passing interface (MPI); RevBayes+MPI. We evaluated our new implementation of RevBayes+BEAGLE using multi-threading on CPUs and 2 different powerful GPUs (NVidia Titan V and NVIDIA A100) against our native implementation of RevBayes+MPI. We found good improvements in speedup when multiple cores were used, with up to 20-fold speedup when using multiple CPU cores and over 90-fold speedup when using multiple GPU cores. The improvement depended on the data type used, DNA or amino acids, and the size of the alignment, but less on the size of the tree. We additionally investigated the cost of rescaling partial likelihoods to avoid numerical underflow and showed that unnecessarily frequent and inefficient rescaling can increase runtimes up to 4-fold. Finally, we presented and compared a new approach to store partial likelihoods on branches instead of nodes that can speed up computations up to 1.7 times but comes at twice the memory requirements.
Assuntos
Teorema de Bayes , Filogenia , Software , Classificação/métodos , Biologia Computacional/métodosRESUMO
This study investigates the impact of spatio- temporal correlation using four spatio-temporal models: Spatio-Temporal Poisson Linear Trend Model (SPLTM), Poisson Temporal Model (TMS), Spatio-Temporal Poisson Anova Model (SPAM), and Spatio-Temporal Poisson Separable Model (STSM) concerning food security and nutrition in Africa. Evaluating model goodness of fit using the Watanabe Akaike Information Criterion (WAIC) and assessing bias through root mean square error and mean absolute error values revealed a consistent monotonic pattern. SPLTM consistently demonstrates a propensity for overestimating food security, while TMS exhibits a diverse bias profile, shifting between overestimation and underestimation based on varying correlation settings. SPAM emerges as a beacon of reliability, showcasing minimal bias and WAIC across diverse scenarios, while STSM consistently underestimates food security, particularly in regions marked by low to moderate spatio-temporal correlation. SPAM consistently outperforms other models, making it a top choice for modeling food security and nutrition dynamics in Africa. This research highlights the impact of spatial and temporal correlations on food security and nutrition patterns and provides guidance for model selection and refinement. Researchers are encouraged to meticulously evaluate the biases and goodness of fit characteristics of models, ensuring their alignment with the specific attributes of their data and research goals. This knowledge empowers researchers to select models that offer reliability and consistency, enhancing the applicability of their findings.
Assuntos
Segurança Alimentar , África , Segurança Alimentar/métodos , Análise Espaço-Temporal , Humanos , Simulação por Computador , Distribuição de PoissonRESUMO
The incubation period is of paramount importance in infectious disease epidemiology as it informs about the transmission potential of a pathogenic organism and helps to plan public health strategies to keep an epidemic outbreak under control. Estimation of the incubation period distribution from reported exposure times and symptom onset times is challenging as the underlying data is coarse. We develop a new Bayesian methodology using Laplacian-P-splines that provides a semi-parametric estimation of the incubation density based on a Langevinized Gibbs sampler. A finite mixture density smoother informs a set of parametric distributions via moment matching and an information criterion arbitrates between competing candidates. Algorithms underlying our method find a natural nest within the EpiLPS package, which has been extended to cover estimation of incubation times. Various simulation scenarios accounting for different levels of data coarseness are considered with encouraging results. Applications to real data on COVID-19, MERS and Mpox reveal results that are in alignment with what has been obtained in recent studies. The proposed flexible approach is an interesting alternative to classic Bayesian parametric methods for estimation of the incubation distribution.
RESUMO
We propose a nonparametric compound Poisson model for underreported count data that introduces a latent clustering structure for the reporting probabilities. The latter are estimated with the model's parameters based on experts' opinion and exploiting a proxy for the reporting process. The proposed model is used to estimate the prevalence of chronic kidney disease in Apulia, Italy, based on a unique statistical database covering information on m = 258 municipalities obtained by integrating multisource register information. Accurate prevalence estimates are needed for monitoring, surveillance, and management purposes; yet, counts are deemed to be considerably underreported, especially in some areas of Apulia, one of the most deprived and heterogeneous regions in Italy. Our results agree with previous findings and highlight interesting geographical patterns of the disease. We compare our model to existing approaches in the literature using simulated as well as real data on early neonatal mortality risk in Brazil, described in previous research: the proposed approach proves to be accurate and particularly suitable when partial information about data quality is available.
RESUMO
Dynamic Bayesian networks (DBNs) can be used for the discovery of gene regulatory networks (GRNs) from time series gene expression data. Here, we suggest a strategy for learning DBNs from gene expression data by employing a Bayesian approach that is scalable to large networks and is targeted at learning models with high predictive accuracy. Our framework can be used to learn DBNs for multiple groups of samples and highlight differences and similarities in their GRNs. We learn these DBN models based on different structural and parametric assumptions and select the optimal model based on the cross-validated predictive accuracy. We show in simulation studies that our approach is better equipped to prevent overfitting than techniques used in previous studies. We applied the proposed DBN-based approach to two time series transcriptomic datasets from the Gene Expression Omnibus database, each comprising data from distinct phenotypic groups of the same tissue type. In the first case, we used DBNs to characterize responders and non-responders to anti-cancer therapy. In the second case, we compared normal to tumor cells of colorectal tissue. The classification accuracy reached by the DBN-based classifier for both datasets was higher than reported previously. For the colorectal cancer dataset, our analysis suggested that GRNs for cancer and normal tissues have a lot of differences, which are most pronounced in the neighborhoods of oncogenes and known cancer tissue markers. The identified differences in gene networks of cancer and normal cells may be used for the discovery of targeted therapies.
Assuntos
Biologia Computacional , Redes Reguladoras de Genes , Algoritmos , Teorema de Bayes , Biologia Computacional/métodos , Simulação por ComputadorRESUMO
Metabolic reaction rates (fluxes) play a crucial role in comprehending cellular phenotypes and are essential in areas such as metabolic engineering, biotechnology, and biomedical research. The state-of-the-art technique for estimating fluxes is metabolic flux analysis using isotopic labelling (13C-MFA), which uses a dataset-model combination to determine the fluxes. Bayesian statistical methods are gaining popularity in the field of life sciences, but the use of 13C-MFA is still dominated by conventional best-fit approaches. The slow take-up of Bayesian approaches is, at least partly, due to the unfamiliarity of Bayesian methods to metabolic engineering researchers. To address this unfamiliarity, we here outline similarities and differences between the two approaches and highlight particular advantages of the Bayesian way of flux analysis. With a real-life example, re-analysing a moderately informative labelling dataset of E. coli, we identify situations in which Bayesian methods are advantageous and more informative, pointing to potential pitfalls of current 13C-MFA evaluation approaches. We propose the use of Bayesian model averaging (BMA) for flux inference as a means of overcoming the problem of model uncertainty through its tendency to assign low probabilities to both, models that are unsupported by data, and models that are overly complex. In this capacity, BMA resembles a tempered Ockham's razor. With the tempered razor as a guide, BMA-based 13C-MFA alleviates the problem of model selection uncertainty and is thereby capable of becoming a game changer for metabolic engineering by uncovering new insights and inspiring novel approaches.
Assuntos
Teorema de Bayes , Isótopos de Carbono , Escherichia coli , Isótopos de Carbono/metabolismo , Escherichia coli/metabolismo , Escherichia coli/genética , Análise do Fluxo Metabólico/métodos , Modelos Biológicos , Engenharia Metabólica/métodos , Marcação por IsótopoRESUMO
Bayesian phylogenetics is now facing a critical point. Over the last 20 years, Bayesian methods have reshaped phylogenetic inference and gained widespread popularity due to their high accuracy, the ability to quantify the uncertainty of inferences and the possibility of accommodating multiple aspects of evolutionary processes in the models that are used. Unfortunately, Bayesian methods are computationally expensive, and typical applications involve at most a few hundred sequences. This is problematic in the age of rapidly expanding genomic data and increasing scope of evolutionary analyses, forcing researchers to resort to less accurate but faster methods, such as maximum parsimony and maximum likelihood. Does this spell doom for Bayesian methods? Not necessarily. Here, we discuss some recently proposed approaches that could help scale up Bayesian analyses of evolutionary problems considerably. We focus on two particular aspects: online phylogenetics, where new data sequences are added to existing analyses, and alternatives to Markov chain Monte Carlo (MCMC) for scalable Bayesian inference. We identify 5 specific challenges and discuss how they might be overcome. We believe that online phylogenetic approaches and Sequential Monte Carlo hold great promise and could potentially speed up tree inference by orders of magnitude. We call for collaborative efforts to speed up the development of methods for real-time tree expansion through online phylogenetics.
Assuntos
Evolução Biológica , Modelos Genéticos , Filogenia , Teorema de Bayes , Método de Monte Carlo , Cadeias de MarkovRESUMO
BACKGROUND: For investigating the individual-environment interplay and individual differences in response to environmental exposures as captured by models of environmental sensitivity including Diathesis-stress, Differential Susceptibility, and Vantage Sensitivity, over the last few years, a series of statistical guidelines have been proposed. However, available solutions suffer of computational problems especially relevant when sample size is not sufficiently large, a common condition in observational and clinical studies. METHOD: In the current contribution, we propose a Bayesian solution for estimating interaction parameters via Monte Carlo Markov Chains (MCMC), adapting Widaman et al. (Psychological Methods, 17, 2012, 615) Nonlinear Least Squares (NLS) approach. RESULTS: Findings from an applied exemplification and a simulation study showed that with relatively big samples both MCMC and NLS estimates converged on the same results. Conversely, MCMC clearly outperformed NLS, resolving estimation problems and providing more accurate estimates, particularly with small samples and greater residual variance. CONCLUSIONS: As the body of research exploring the interplay between individual and environmental variables grows, enabling predictions regarding the form of interaction and the extent of effects, the Bayesian approach could emerge as a feasible and readily applicable solution to numerous computational challenges inherent in existing frequentist methods. This approach holds promise for enhancing the trustworthiness of research outcomes, thereby impacting clinical and applied understanding.
Assuntos
Teorema de Bayes , Humanos , Individualidade , Cadeias de Markov , Método de Monte CarloRESUMO
Survival models are used to analyze time-to-event data in a variety of disciplines. Proportional hazard models provide interpretable parameter estimates, but proportional hazard assumptions are not always appropriate. Non-parametric models are more flexible but often lack a clear inferential framework. We propose a Bayesian treed hazards partition model that is both flexible and inferential. Inference is obtained through the posterior tree structure and flexibility is preserved by modeling the log-hazard function in each partition using a latent Gaussian process. An efficient reversible jump Markov chain Monte Carlo algorithm is accomplished by marginalizing the parameters in each partition element via a Laplace approximation. Consistency properties for the estimator are established. The method can be used to help determine subgroups as well as prognostic and/or predictive biomarkers in time-to-event data. The method is compared with some existing methods on simulated data and a liver cirrhosis dataset.
Assuntos
Algoritmos , Modelos de Riscos Proporcionais , Teorema de Bayes , Cadeias de Markov , Método de Monte CarloRESUMO
There is a growing body of literature on knowledge-guided statistical learning methods for analysis of structured high-dimensional data (such as genomic and transcriptomic data) that can incorporate knowledge of underlying networks derived from functional genomics and functional proteomics. These methods have been shown to improve variable selection and prediction accuracy and yield more interpretable results. However, these methods typically use graphs extracted from existing databases or rely on subject matter expertise, which are known to be incomplete and may contain false edges. To address this gap, we propose a graph-guided Bayesian modeling framework to account for network noise in regression models involving structured high-dimensional predictors. Specifically, we use 2 sources of network information, including the noisy graph extracted from existing databases and the estimated graph from observed predictors in the dataset at hand, to inform the model for the true underlying network via a latent scale modeling framework. This model is coupled with the Bayesian regression model with structured high-dimensional predictors involving an adaptive structured shrinkage prior. We develop an efficient Markov chain Monte Carlo algorithm for posterior sampling. We demonstrate the advantages of our method over existing methods in simulations, and through analyses of a genomics dataset and another proteomics dataset for Alzheimer's disease.
Assuntos
Doença de Alzheimer , Genômica , Humanos , Teorema de Bayes , Algoritmos , Doença de Alzheimer/genética , Bases de Dados FactuaisRESUMO
The crucial impact of the microbiome on human health and disease has gained significant scientific attention. Researchers seek to connect microbiome features with health conditions, aiming to predict diseases and develop personalized medicine strategies. However, the practicality of conventional models is restricted due to important aspects of microbiome data. Specifically, the data observed is compositional, as the counts within each sample are bound by a fixed-sum constraint. Moreover, microbiome data often exhibits high dimensionality, wherein the number of variables surpasses the available samples. In addition, microbiome features exhibiting phenotypical similarity usually have similar influence on the response variable. To address the challenges posed by these aspects of the data structure, we proposed Bayesian compositional generalized linear models for analyzing microbiome data (BCGLM) with a structured regularized horseshoe prior for the compositional coefficients and a soft sum-to-zero restriction on coefficients through the prior distribution. We fitted the proposed models using Markov Chain Monte Carlo (MCMC) algorithms with R package rstan. The performance of the proposed method was assessed by extensive simulation studies. The simulation results show that our approach outperforms existing methods with higher accuracy of coefficient estimates and lower prediction error. We also applied the proposed method to microbiome study to find microorganisms linked to inflammatory bowel disease (IBD). To make this work reproducible, the code and data used in this article are available at https://github.com/Li-Zhang28/BCGLM.
Assuntos
Microbiota , Humanos , Modelos Lineares , Teorema de Bayes , Simulação por Computador , AlgoritmosRESUMO
Throughout the course of an epidemic, the rate at which disease spreads varies with behavioral changes, the emergence of new disease variants, and the introduction of mitigation policies. Estimating such changes in transmission rates can help us better model and predict the dynamics of an epidemic, and provide insight into the efficacy of control and intervention strategies. We present a method for likelihood-based estimation of parameters in the stochastic susceptible-infected-removed model under a time-inhomogeneous transmission rate comprised of piecewise constant components. In doing so, our method simultaneously learns change points in the transmission rate via a Markov chain Monte Carlo algorithm. The method targets the exact model posterior in a difficult missing data setting given only partially observed case counts over time. We validate performance on simulated data before applying our approach to data from an Ebola outbreak in Western Africa and COVID-19 outbreak on a university campus.
Assuntos
Epidemias , Doença pelo Vírus Ebola , Humanos , Funções Verossimilhança , Cadeias de Markov , Surtos de Doenças , Método de Monte Carlo , Teorema de Bayes , Processos EstocásticosRESUMO
BACKGROUND: The analysis of dental caries has been a major focus of recent work on modeling dental defect data. While a dental caries focus is of major importance in dental research, the examination of developmental defects which could also contribute at an early stage of dental caries formation, is also of potential interest. This paper proposes a set of methods which address the appearance of different combinations of defects across different tooth regions. In our modeling we assess the linkages between tooth region development and both the type of defect and associations with etiological predictors of the defects which could be influential at different times during the tooth crown development. METHODS: We develop different hierarchical model formulations under the Bayesian paradigm to assess exposures during primary central incisor (PMCI) tooth development and PMCI defects. We evaluate the Bayesian hierarchical models under various simulation scenarios to compare their performance with both simulated dental defect data and real data from a motivating application. RESULTS: The proposed model provides inference on identifying a subset of etiological predictors of an individual defect accounting for the correlation between tooth regions and on identifying a subset of etiological predictors for the joint effect of defects. Furthermore, the model provides inference on the correlation between the regions of the teeth as well as between the joint effect of the developmental enamel defects and dental caries. Simulation results show that the proposed model consistently yields steady inferences in identifying etiological biomarkers associated with the outcome of localized developmental enamel defects and dental caries under varying simulation scenarios as deemed by small mean square error (MSE) when comparing the simulation results to real application results. CONCLUSION: We evaluate the proposed model under varying simulation scenarios to develop a model for multivariate dental defects and dental caries assuming a flexible covariance structure that can handle regional and joint effects. The proposed model shed new light on methods for capturing inclusive predictors in different multivariate joint models under the same covariance structure and provides a natural extension to a nested hierarchical model.
Assuntos
Cárie Dentária , Incisivo , Criança , Humanos , Teorema de Bayes , Dente Decíduo , Prevalência , Esmalte DentárioRESUMO
Greenhouse gas (GHG) emissions from streams and rivers are important sources of global GHG emissions. As a crucial parameter for estimating GHG emissions, the gas transfer coefficient (expressed as K600 at water temperature of 20 °C) has uncertainties. This study proposed a new approach for estimating K600 based on high-frequency dissolved oxygen (DO) data and an ecosystem metabolism model. This approach combines the numerical solution method with the Markov Chain Monte Carlo analysis. This study was conducted in the Chaohu Lake watershed in Southeastern China, using high-frequency data collected from six streams from 2021 to 2023. This study found: (1) The numerical solution of K600 demonstrated distinct dynamic variability for all streams, ranging from 0 to 111.39 cm h-1 (2) Streams with higher discharge (>10 m3 s-1) exhibited significant seasonal differences in K600 values. The monthly average discharge and water temperature were the two factors that determined the variation in K600 values. (3) K600 was a major source of uncertainty in CO2 emission fluxes, with a relative contribution of 53.72%. An integrated K600 model of riverine gas exchange was developed at the watershed scale and validated using the observed DO change. Our study stressed that K600 dynamics can better represent areal change to reduce uncertainty in estimating GHG emissions.
RESUMO
COVID-19 is a respiratory disease triggered by an RNA virus inclined to mutations. Since December 2020, variants of COVID-19 (especially Delta and Omicron) continuously appeared with different characteristics that influenced death and transmissibility emerged around the world. To address the novel dynamics of the disease, we propose and analyze a dynamical model of two strains, namely native and mutant, transmission dynamics with mutation and imperfect vaccination. It is also assumed that the recuperated individuals from the native strain can be infected with mutant strain through the direct contact with individual or contaminated surfaces or aerosols. We compute the basic reproduction number, R 0 , which is the maximum of the basic reproduction numbers of native and mutant strains. We prove the nonexistence of backward bifurcation using the center manifold theory, and global stability of disease-free equilibrium when R 0 < 1 , that is, vaccine is effective enough to eliminate the native and mutant strains even if it cannot provide full protection. Hopf bifurcation appears when the endemic equilibrium loses its stability. An intermediate mutation rate ν 1 leads to oscillations. When ν 1 increases over a threshold, the system regains its stability and exhibits an interesting dynamics called endemic bubble. An analytical expression for vaccine-induced herd immunity is derived. The epidemiological implication of the herd immunity threshold is that the disease may effectively be eradicated if the minimum herd immunity threshold is attained in the community. Furthermore, the model is parameterized using the Indian data of the cumulative number of confirmed cases and deaths of COVID-19 from March 1 to September 27 in 2021, using MCMC method. The cumulative cases and deaths can be reduced by increasing the vaccine efficacies to both native and mutant strains. We observe that by considering the vaccine efficacy against native strain as 90%, both cumulative cases and deaths would be reduced by 0.40%. It is concluded that increasing immunity against mutant strain is more influential than the vaccine efficacy against it in controlling the total cases. Our study demonstrates that the COVID-19 pandemic may be worse due to the occurrence of oscillations for certain mutation rates (i.e., outbreaks will occur repeatedly) but better due to stability at a lower infection level with a larger mutation rate. We perform sensitivity analysis using the Latin Hypercube Sampling methodology and partial rank correlation coefficients to illustrate the impact of parameters on the basic reproduction number, the number of cumulative cases and deaths, which ultimately sheds light on disease mitigation.
Assuntos
COVID-19 , Vacinas , Humanos , SARS-CoV-2/genética , COVID-19/epidemiologia , COVID-19/prevenção & controle , Pandemias/prevenção & controle , Mutação , VacinaçãoRESUMO
The most common type of cancer diagnosed among children is the Acute Lymphocytic Leukemia (ALL). A study was conducted by Tata Translational Cancer Research Center (TTCRC) Kolkata, in which 236 children (diagnosed as ALL patients) were treated for the first two years (approximately) with two standard drugs (6MP and MTx) and were then followed nearly for the next 3 years. The goal is to identify the longitudinal biomarkers that are associated with time-to-relapse, and also to assess the effectiveness of the drugs. We develop a Bayesian joint model in which a linear mixed model is used to jointly model three biomarkers (i.e. white blood cell count, neutrophil count, and platelet count) and a semi-parametric proportional hazards model is used to model the time-to-relapse. Our proposed joint model can assess the effects of different covariates on the progression of the biomarkers, and the effects of the biomarkers (and the covariates) on time-to-relapse. In addition, the proposed joint model can impute the missing longitudinal biomarkers efficiently. Our analysis shows that the white blood cell (WBC) count is not associated with time-to-relapse, but the neutrophil count and the platelet count are significantly associated with it. We also infer that a lower dose of 6MP and a higher dose of MTx jointly result in a lower relapse probability in the follow-up period. Interestingly, we find that relapse probability is the lowest for the patients classified into the "high-risk" group at presentation. The effectiveness of the proposed joint model is assessed through the extensive simulation studies.
Assuntos
Mercaptopurina , Leucemia-Linfoma Linfoblástico de Células Precursoras , Criança , Humanos , Mercaptopurina/efeitos adversos , Teorema de Bayes , Metotrexato/uso terapêutico , Leucemia-Linfoma Linfoblástico de Células Precursoras/induzido quimicamente , Leucemia-Linfoma Linfoblástico de Células Precursoras/tratamento farmacológico , Recidiva , Biomarcadores , Estudos LongitudinaisRESUMO
In multiple instance learning (MIL), a bag represents a sample that has a set of instances, each of which is described by a vector of explanatory variables, but the entire bag only has one label/response. Though many methods for MIL have been developed to date, few have paid attention to interpretability of models and results. The proposed Bayesian regression model stands on two levels of hierarchy, which transparently show how explanatory variables explain and instances contribute to bag responses. Moreover, two selection problems are simultaneously addressed; the instance selection to find out the instances in each bag responsible for the bag response, and the variable selection to search for the important covariates. To explore a joint discrete space of indicator variables created for selection of both explanatory variables and instances, the shotgun stochastic search algorithm is modified to fit in the MIL context. Also, the proposed model offers a natural and rigorous way to quantify uncertainty in coefficient estimation and outcome prediction, which many modern MIL applications call for. The simulation study shows the proposed regression model can select variables and instances with high performance (AUC greater than 0.86), thus predicting responses well. The proposed method is applied to the musk data for prediction of binding strengths (labels) between molecules (bags) with different conformations (instances) and target receptors. It outperforms all existing methods, and can identify variables relevant in modeling responses.