RESUMO
Advancing spatially resolved transcriptomics (ST) technologies help biologists comprehensively understand organ function and tissue microenvironment. Accurate spatial domain identification is the foundation for delineating genome heterogeneity and cellular interaction. Motivated by this perspective, a graph deep learning (GDL) based spatial clustering approach is constructed in this paper. First, the deep graph infomax module embedded with residual gated graph convolutional neural network is leveraged to address the gene expression profiles and spatial positions in ST. Then, the Bayesian Gaussian mixture model is applied to handle the latent embeddings to generate spatial domains. Designed experiments certify that the presented method is superior to other state-of-the-art GDL-enabled techniques on multiple ST datasets. The codes and dataset used in this manuscript are summarized at https://github.com/narutoten520/SCGDL.
Assuntos
Aprendizado Profundo , Transcriptoma , Teorema de Bayes , Perfilação da Expressão Gênica , Comunicação CelularRESUMO
We introduce a mathematical model based on mixture theory intended to describe the tumor-immune system interactions within the tumor microenvironment. The equations account for the geometry of the tumor expansion, and the displacement of the immune cells, driven by diffusion and chemotactic mechanisms. They also take into account the constraints in terms of nutrient and oxygen supply. The numerical investigations analyze the impact of the different modeling assumptions and parameters. Depending on the parameters, the model can reproduce elimination, equilibrium or escape phases and it identifies a critical role of oxygen/nutrient supply in shaping the tumor growth. In addition, antitumor immune cells are key factors in controlling tumor growth, maintaining an equilibrium while protumor cells favor escape and tumor expansion.
Assuntos
Neoplasias , Humanos , Neoplasias/patologia , Sistema Imunitário , Matemática , Oxigênio , Microambiente TumoralRESUMO
We consider the Bayesian estimation of the parameters of a finite mixture model from independent order statistics arising from imperfect ranked set sampling designs. As a cost-effective method, ranked set sampling enables us to incorporate easily attainable characteristics, as ranking information, into data collection and Bayesian estimation. To handle the special structure of the ranked set samples, we develop a Bayesian estimation approach exploiting the Expectation-Maximization (EM) algorithm in estimating the ranking parameters and Metropolis within Gibbs Sampling to estimate the parameters of the underlying mixture model. Our findings show that the proposed RSS-based Bayesian estimation method outperforms the commonly used Bayesian counterpart using simple random sampling. The developed method is finally applied to estimate the bone disorder status of women aged 50 and older.
Assuntos
Algoritmos , Teorema de Bayes , Modelos Estatísticos , Humanos , Feminino , Pessoa de Meia-Idade , Idoso , Simulação por Computador , Método de Monte Carlo , Funções Verossimilhança , Cadeias de MarkovRESUMO
Investigations into household structure in low- and middle-income countries (LMICs) provide important insight into how families manage domestic life in response to resource allocation and caregiving needs during periods of rapid sociopolitical and health-related challenges. Recent evidence on household structure in many LMICs contrasts with long-standing viewpoints of worldwide convergence to a Western nuclearized household model. Here, we adopt a household-centered theoretical and methodological framework to investigate longitudinal patterns and dynamics of household structure in a rural South African setting during a period of high AIDS-related mortality and socioeconomic change. Data come from the Agincourt Health and Socio-Demographic Surveillance System (2003-2015). Using latent transition models, we derived six distinct household types by examining conditional interdependency between household heads' characteristics, members' age composition, and migration status. More than half of households were characterized by their complex and multigenerational profiles, with considerable within-typology variation in household size and dependency structure. Transition analyses showed stability of household types under female headship, while higher proportions of nuclearized household types dissolved over time. Household dissolution was closely linked to prior mortality experiences-particularly, following death of a male head. Our findings highlight the need to better conceptualize and contextualize household changes across populations and over time.
Assuntos
Características da Família , População Rural , Humanos , Masculino , Feminino , Fatores Socioeconômicos , Estudos Longitudinais , África do Sul/epidemiologiaRESUMO
BACKGROUND: Overweight and obesity are among the leading chronic diseases worldwide. Environmental phenols have been renowned as endocrine disruptors that contribute to weight changes; however, the effects of exposure to mixed phenols on obesity are not well established. METHODS: Using data from adults in National Health and Nutrition Examination Survey, this study examined the individual and combined effects of four phenols on obesity. A combination of traditional logistic regression and two mixed models (weighted quantile sum (WQS) regression and Bayesian kernel-machine regression (BKMR)) were used together to assess the role of phenols in the development of obesity. The potential mediation of cholesterol on these effects was analyzed through a parallel mediation model. RESULTS: The results demonstrated that solitary phenols except triclosan were inversely associated with obesity (P-value < 0.05). The WQS index was also negatively correlated with general obesity (ß: 0.770, 95% CI: 0.644-0.919, P-value = 0.004) and abdominal obesity (ß: 0.781, 95% CI: 0.658-0.928, P-value = 0.004). Consistently, the BKMR model demonstrated the significant joint negative effects of phenols on obesity. The parallel mediation analysis revealed that high-density lipoprotein mediated the effects of all four single phenols on obesity, whereas low-density lipoprotein only mediated the association between benzophenol-3 and obesity. Moreover, Cholesterol acts as a mediator of the association between mixed phenols and obesity. Exposure to single and mixed phenols significantly and negatively correlated with obesity. Cholesterol mediated the association of single and mixed environmental phenols with obesity. CONCLUSIONS: Assessing the potential public health risks of mixed phenols helps to incorporate this information into practical health advice and guidance.
Assuntos
Isoflavonas , Obesidade , Fenóis , Humanos , Fenóis/urina , Masculino , Adulto , Feminino , Pessoa de Meia-Idade , Colesterol/sangue , Compostos Benzidrílicos/urina , Triclosan/efeitos adversos , Inquéritos Nutricionais , Teorema de Bayes , Disruptores Endócrinos/urina , Clorofenóis/urinaRESUMO
In the realm of road safety and the evolution toward automated driving, Advanced Driver Assistance and Automated Driving (ADAS/AD) systems play a pivotal role. As the complexity of these systems grows, comprehensive testing becomes imperative, with virtual test environments becoming crucial, especially for handling diverse and challenging scenarios. Radar sensors are integral to ADAS/AD units and are known for their robust performance even in adverse conditions. However, accurately modeling the radar's perception, particularly the radar cross-section (RCS), proves challenging. This paper adopts a data-driven approach, using Gaussian mixture models (GMMs) to model the radar's perception for various vehicles and aspect angles. A Bayesian variational approach automatically infers model complexity. The model is expanded into a comprehensive radar sensor model based on object lists, incorporating occlusion effects and RCS-based detectability decisions. The model's effectiveness is demonstrated through accurate reproduction of the RCS behavior and scatter point distribution. The full capabilities of the sensor model are demonstrated in different scenarios. The flexible and modular framework has proven apt for modeling specific aspects and allows for an easy model extension. Simultaneously, alongside model extension, more extensive validation is proposed to refine accuracy and broaden the model's applicability.
RESUMO
Clustering methods are increasingly used in social science research. Generally, researchers use them to infer the existence of qualitatively different types of individuals within a larger population, thus unveiling previously "hidden" heterogeneity. Depending on the clustering technique, however, valid inference requires some conditions and assumptions. Common risks include not only failing to detect existing clusters due to a lack of power but also revealing clusters that do not exist in the population. Simple data simulations suggest that under conditions of sample size, number, correlation and skewness of indicators that are frequently encountered in applied psychological research, commonly used clustering methods are at a high risk of detecting clusters that are not there. Generally, this is due to some violations of assumptions that are not usually considered critical in psychology. The present article illustrates a simple R tutorial and a Shiny app (for those who are not familiar with R) that allow researchers to quantify a priori inferential risks when performing clustering methods on their own data. Doing so is suggested as a much-needed preliminary sanity check, because conditions that inflate the number of detected clusters are very common in applied psychological research scenarios.
RESUMO
It is now common to have a modest to large number of features on individuals with complex diseases. Unsupervised analyses, such as clustering with and without preprocessing by Principle Component Analysis (PCA), is widely used in practice to uncover subgroups in a sample. However, in many modern studies features are often highly correlated and noisy (e.g. SNP's, -omics, quantitative imaging markers, and electronic health record data). The practical performance of clustering approaches in these settings remains unclear. Through extensive simulations and empirical examples applying Gaussian Mixture Models and related clustering methods, we show these approaches (including variants of kmeans, VarSelLCM, HDClassifier, and Fisher-EM) can have very poor performance in many settings. We also show the poor performance is often driven by either an explicit or implicit assumption by the clustering algorithm that high variance features are relevant while lower variance features are irrelevant, called the variance as relevance assumption. We develop practical pre-processing approaches that improve analysis performance in some cases. This work offers practical guidance on the strengths and limitations of unsupervised clustering approaches in modern data analysis applications.
RESUMO
BACKGROUND: One strategy for identifying targets of a regulatory factor is to perturb the factor and use high-throughput RNA sequencing to examine the consequences. However, distinguishing direct targets from secondary effects and experimental noise can be challenging when confounding signal is present in the background at varying levels. RESULTS: Here, we present a statistical modeling strategy to identify microRNAs that are primary substrates of target-directed miRNA degradation (TDMD) mediated by ZSWIM8. This method uses a bi-beta-uniform mixture (BBUM) model to separate primary from background signal components, leveraging the expectation that primary signal is restricted to upregulation and not downregulation upon loss of ZSWIM8. The BBUM model strategy retained the apparent sensitivity and specificity of the previous ad hoc approach but was more robust against outliers, achieved a more consistent stringency, and could be performed using a single cutoff of false discovery rate (FDR). CONCLUSIONS: We developed the BBUM model, a robust statistical modeling strategy to account for background secondary signal in differential expression data. It performed well for identifying primary substrates of TDMD and should be useful for other applications in which the primary regulatory targets are only upregulated or only downregulated. The BBUM model, FDR-correction algorithm, and significance-testing methods are available as an R package at https://github.com/wyppeter/bbum .
Assuntos
MicroRNAs , MicroRNAs/genética , Algoritmos , Sequência de Bases , Modelos Estatísticos , Análise de Sequência de RNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodosRESUMO
Serological assays used to estimate the prevalence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) often rely on manufacturers' cutoffs established on the basis of severe cases. We conducted a household-based serosurvey of 4,677 individuals in Chennai, India, from January to May 2021. Samples were tested for SARS-CoV-2 immunoglobulin G (IgG) antibodies to the spike (S) and nucleocapsid (N) proteins. We calculated seroprevalence, defining seropositivity using manufacturer cutoffs and using a mixture model based on measured IgG level. Using manufacturer cutoffs, there was a 5-fold difference in seroprevalence estimated by each assay. This difference was largely reconciled using the mixture model, with estimated anti-S and anti-N IgG seroprevalence of 64.9% (95% credible interval (CrI): 63.8, 66.0) and 51.5% (95% CrI: 50.2, 52.9), respectively. Age and socioeconomic factors showed inconsistent relationships with anti-S and anti-N IgG seropositivity using manufacturer cutoffs. In the mixture model, age was not associated with seropositivity, and improved household ventilation was associated with lower seropositivity odds. With global vaccine scale-up, the utility of the more stable anti-S IgG assay may be limited due to the inclusion of the S protein in several vaccines. Estimates of SARS-CoV-2 seroprevalence using alternative targets must consider heterogeneity in seroresponse to ensure that seroprevalence is not underestimated and correlates are not misinterpreted.
Assuntos
COVID-19 , Humanos , Índia/epidemiologia , COVID-19/epidemiologia , SARS-CoV-2 , Estudos Soroepidemiológicos , Imunoglobulina GRESUMO
Population-scale biobanks that combine genetic data and high-dimensional phenotyping for a large number of participants provide an exciting opportunity to perform genome-wide association studies (GWAS) to identify genetic variants associated with diverse quantitative traits and diseases. A major challenge for GWAS in population biobanks is ascertaining disease cases from heterogeneous data sources such as hospital records, digital questionnaire responses, or interviews. In this study, we use genetic parameters, including genetic correlation, to evaluate whether GWAS performed using cases in the UK Biobank ascertained from hospital records, questionnaire responses, and family history of disease implicate similar disease genetics across a range of effect sizes. We find that hospital record and questionnaire GWAS largely identify similar genetic effects for many complex phenotypes and that combining together both phenotyping methods improves power to detect genetic associations. We also show that family history GWAS using cases ascertained on family history of disease agrees with combined hospital record and questionnaire GWAS and that family history GWAS has better power to detect genetic associations for some phenotypes. Overall, this work demonstrates that digital phenotyping and unstructured phenotype data can be combined with structured data such as hospital records to identify cases for GWAS in biobanks and improve the ability of such studies to identify genetic associations.
Assuntos
Doença/genética , Estudo de Associação Genômica Ampla , Fenótipo , Asma/genética , Bases de Dados Factuais , Feminino , Genética Médica , Genótipo , Humanos , Masculino , Neoplasias/genética , Reino UnidoRESUMO
Computer simulations are increasingly used to access thermo-kinetic information underlying structural transformation of protein kinases. Such information are necessary to probe their roles in disease progression and interactions with drug targets. However, the investigations are frequently challenged by forbiddingly high computational expense, and by the lack of standard protocols for the design of low dimensional physical descriptors that encode system features important for transitions. Here, we consider the demarcating characteristics of the different states of Abelson tyrosine kinase associated with distinct catalytic activity to construct a set of physically meaningful, orthogonal collective variables that preserve the slow modes of the system. Independent sampling of each metastable state is followed by the estimation of global partition function along the appropriate physical descriptors using the modified Expectation Maximized Molecular Dynamics method. The resultant free energy barriers are in excellent agreement with experimentally known rate-limiting dynamics and activation energy computed with conventional enhanced sampling methods. We discuss possible directions for further development and applications.
Assuntos
Simulação de Dinâmica Molecular , Proteínas Tirosina Quinases , Entropia , Catálise , CinéticaRESUMO
Scanning electron microscopy has been a powerful technique to investigate the structural and chemical properties of multiphase materials on micro and nanoscale due to its high-resolution capabilities. One of the main outcomes of the SEM-based analysis is the calculation of the fractions of material components constituting the multiphase material by means of the segmentation of their back scattered electron SEM images. In order to segment multiphase images, Gaussian mixture models (GMMs) are commonly used based on the deconvolution of the image pixel histogram. Despite its extensive use, the accuracy of GMM predictions has not been validated yet. In this paper, we proceed to a systematic study of the evaluation of the accuracy and the limitations of the GMM method when applied to the segmentation of a four-phase material. To this end, first, we build a modelling framework and propose an index to quantify the accuracy of GMM predictions for all phases. Then we apply this framework to calculate the impact of collective parameters of image histogram on the accuracy of GMM predictions. Finally, some rules of thumb are concluded to guide SEM users about the suitability of using GMM for the segmentation of their SEM images based only on the inspection of the image histogram. A suitable histogram for GMM is a histogram with number of peaks equal to the number of Gaussian components, and if that is not the case, kurtosis and skewness should be smaller than 2.35 and 0.1, respectively.
Assuntos
Algoritmos , Processamento de Imagem Assistida por Computador , Processamento de Imagem Assistida por Computador/métodos , Distribuição Normal , Microscopia Eletrônica de VarreduraRESUMO
High throughput spatial transcriptomics (HST) is a rapidly emerging class of experimental technologies that allow for profiling gene expression in tissue samples at or near single-cell resolution while retaining the spatial location of each sequencing unit within the tissue sample. Through analyzing HST data, we seek to identify sub-populations of cells within a tissue sample that may inform biological phenomena. Existing computational methods either ignore the spatial heterogeneity in gene expression profiles, fail to account for important statistical features such as skewness, or are heuristic-based network clustering methods that lack the inferential benefits of statistical modeling. To address this gap, we develop SPRUCE: a Bayesian spatial multivariate finite mixture model based on multivariate skew-normal distributions, which is capable of identifying distinct cellular sub-populations in HST data. We further implement a novel combination of Pólya-Gamma data augmentation and spatial random effects to infer spatially correlated mixture component membership probabilities without relying on approximate inference techniques. Via a simulation study, we demonstrate the detrimental inferential effects of ignoring skewness or spatial correlation in HST data. Using publicly available human brain HST data, SPRUCE outperforms existing methods in recovering expertly annotated brain layers. Finally, our application of SPRUCE to human breast cancer HST data indicates that SPRUCE can distinguish distinct cell populations within the tumor microenvironment. An R package spruce for fitting the proposed models is available through The Comprehensive R Archive Network.
Assuntos
Modelos Estatísticos , Transcriptoma , Humanos , Teorema de Bayes , Simulação por Computador , Perfilação da Expressão GênicaRESUMO
Missing outcomes are commonly encountered in randomized controlled trials (RCT) involving human subjects and present a risk for substantial bias in the results of a complete case analysis. While response rates for RCTs are typically high there is no agreed upon universal threshold under which the amount of missing data is deemed to not be a threat to inference. We focus here on binary outcomes that are possibly missing not at random, that is, the value of the outcome influences its possibility of being observed. Salient information that can assist in addressing these missing outcomes in such situations is the anticipated response rate in each study arm; these can often be anticipated based on prior research in similar populations using similar designs and outcomes. Further, in some areas of human subjects research, we are often confident or we suspect that response rates among RCT participants with successful treatment outcomes will be at least as great as those among participants without successful treatment outcomes. In other settings we may suspect the opposite relationship. This direction of the differential response between those with successful and unsuccessful outcomes can further aid in addressing the missing outcomes. We present simple Bayesian pattern-mixture models that incorporate this information on response rates to analyze the relationship between a binary outcome and an intervention while addressing the missing outcomes. We assess the performance of this method in simulation studies and apply this method to the results of an RCT of a smoking abstinence intervention.
RESUMO
In many healthcare and social science applications, information about units is dispersed across multiple data files. Linking records across files is necessary to estimate the associations of interest. Common record linkage algorithms only rely on similarities between linking variables that appear in all the files. Moreover, analysis of linked files often ignores errors that may arise from incorrect or missed links. Bayesian record linking methods allow for natural propagation of linkage error, by jointly sampling the linkage structure and the model parameters. We extend an existing Bayesian record linkage method to integrate associations between variables exclusive to each file being linked. We show analytically, and using simulations, that the proposed method can improve the linking process, and can result in accurate inferences. We apply the method to link Meals on Wheels recipients to Medicare enrollment records.
Assuntos
Registro Médico Coordenado , Medicare , Idoso , Humanos , Estados Unidos , Teorema de Bayes , Registro Médico Coordenado/métodos , AlgoritmosRESUMO
Multivariate longitudinal data are used in a variety of research areas not only because they allow to analyze time trajectories of multiple indicators, but also to determine how these trajectories are influenced by other covariates. In this article, we propose a mixture of longitudinal factor analyzers. This model could be used to extract latent factors representing multiple longitudinal noisy indicators in heterogeneous longitudinal data and to study the impact of one or several covariates on these latent factors. One of the advantages of this model is that it allows for measurement non-invariance, which arises in practice when the factor structure varies between groups of individuals due to cultural or physiological differences. This is achieved by estimating different factor models for different latent classes. The proposed model could also be used to extract latent classes with different latent factor trajectories over time. Other advantages of the model include its ability to take into account heteroscedasticity of errors in the factor analysis model by estimating different error variances for different latent classes. We first define the mixture of longitudinal factor analyzers and its parameters. Then, we propose an EM algorithm to estimate these parameters. We propose a Bayesian information criterion to identify both the number of components in the mixture and the number of latent factors. We then discuss the comparability of the latent factors obtained between subjects in different latent groups. Finally, we apply the model to simulated and real data of patients with chronic postoperative pain.
Assuntos
Dor Crônica , Humanos , Dor Crônica/diagnóstico , Teorema de Bayes , Algoritmos , Estudos LongitudinaisRESUMO
Modeling longitudinal trajectories and identifying latent classes of trajectories is of great interest in biomedical research, and software to identify latent classes of such is readily available for latent class trajectory analysis (LCTA), growth mixture modeling (GMM) and covariance pattern mixture models (CPMM). In biomedical applications, the level of within-person correlation is often non-negligible, which can impact the model choice and interpretation. LCTA does not incorporate this correlation. GMM does so through random effects, while CPMM specifies a model for within-class marginal covariance matrix. Previous work has investigated the impact of constraining covariance structures, both within and across classes, in GMMs-an approach often used to solve convergence problems. Using simulation, we focused specifically on how misspecification of the temporal correlation structure and strength, but correct variances, impacts class enumeration and parameter estimation under LCTA and CPMM. We found (1) even in the presence of weak correlation, LCTA often does not reproduce original classes, (2) CPMM performs well in class enumeration when the correct correlation structure is selected, and (3) regardless of misspecification of the correlation structure, both LCTA and CPMM give unbiased estimates of the class trajectory parameters when the within-individual correlation is weak and the number of classes is correctly specified. However, the bias increases markedly when the correlation is moderate for LCTA and when the incorrect correlation structure is used for CPMM. This work highlights the importance of correlation alone in obtaining appropriate model interpretations and provides insight into model choice.
Assuntos
Pesquisa Biomédica , Software , Humanos , Simulação por Computador , Análise de Classes Latentes , ViésRESUMO
There is increasing interest in modelling longitudinal dietary data and classifying individuals into subgroups (latent classes) who follow similar trajectories over time. These trajectories could identify population groups and time points amenable to dietary interventions. This paper aimed to provide a comparison and overview of two latent class methods: group-based trajectory modelling (GBTM) and growth mixture modelling (GMM). Data from 2963 mother-child dyads from the longitudinal Southampton Women's Survey were analysed. Continuous diet quality indices (DQI) were derived using principal component analysis from interviewer-administered FFQ collected in mothers pre-pregnancy, at 11- and 34-week gestation, and in offspring at 6 and 12 months and 3, 6-7 and 8-9 years. A forward modelling approach from 1 to 6 classes was used to identify the optimal number of DQI latent classes. Models were assessed using the Akaike and Bayesian information criteria, probability of class assignment, ratio of the odds of correct classification, group membership and entropy. Both methods suggested that five classes were optimal, with a strong correlation (Spearman's = 0·98) between class assignment for the two methods. The dietary trajectories were categorised as stable with horizontal lines and were defined as poor (GMM = 4 % and GBTM = 5 %), poor-medium (23 %, 23 %), medium (39 %, 39 %), medium-better (27 %, 28 %) and best (7 %, 6 %). Both GBTM and GMM are suitable for identifying dietary trajectories. GBTM is recommended as it is computationally less intensive, but results could be confirmed using GMM. The stability of the diet quality trajectories from pre-pregnancy underlines the importance of promotion of dietary improvements from preconception onwards.
Assuntos
Dieta , Mães , Gravidez , Humanos , Feminino , Estudos Longitudinais , Teorema de Bayes , Inquéritos e QuestionáriosRESUMO
The present study demonstrates the potential of synchronous fluorescence spectroscopy and multivariate data analysis for authentication of COVID-19 vaccines from various manufacturers. Synchronous scanning fluorescence spectra were recorded for DNA-based and mRNA-based vaccines obtained through the NHS Central Liverpool Primary Care Network. Fluorescence spectra of DNA and DNA-based vaccines as well as RNA and RNA-based vaccines were identical to one another. The application of principal component analysis (PCA), PCA-Gaussian Mixture Models (PCA-GMM)) and Self-Organising Maps (SOM) methods to the fluorescence spectra of vaccines is discussed. The PCA is applied to extract the characteristic variables of fluorescence spectra by analysing the major attributes. The results indicated that the first three principal components (PCs) can account for 99.5% of the total variance in the data. The PC scores plot showed two distinct clusters corresponding to the DNA-based vaccines and mRNA-based vaccines respectively. PCA-GMM clustering complemented the PCA clusters by further classifying the mRNA-based vaccines and the GMM clusters revealed three mRNA-based vaccines that were not clustered with the other vaccines. SOM complemented both PCA and PCA-GMM and proved effective with multivariate data without the need for dimensions reduction. The findings showed that fluorescence spectroscopy combined with machine learning algorithms (PCA, PCA-GMM and SOM) is a useful technique for vaccination verification and has the benefits of simplicity, speed and reliability.