Pesquisa | Biblioteca Virtual em Saúde

1.

Evaluation of Calibration Approaches for Indoor Deployments of PurpleAir Monitors.

Koehler, Kirsten; Wilks, Megan; Green, Tim; Rule, Ana M; Zamora, Misti L; Buehler, Colby; Datta, Abhirup; Gentner, Drew R; Putcha, Nirupama; Hansel, Nadia N; Kirk, Gregory D; Raju, Sarath; McCormack, Meredith.

Atmos Environ (1994) ; 3102023 Oct 01.

Artigo em Inglês | MEDLINE | ID: mdl-37901719

RESUMO

Low-cost air quality monitors are growing in popularity among both researchers and community members to understand variability in pollutant concentrations. Several studies have produced calibration approaches for these sensors for ambient air. These calibrations have been shown to depend primarily on relative humidity, particle size distribution, and particle composition, which may be different in indoor environments. However, despite the fact that most people spend the majority of their time indoors, little is known about the accuracy of commonly used devices indoors. This stems from the fact that calibration data for sensors operating in indoor environments are rare. In this study, we sought to evaluate the accuracy of the raw data from PurpleAir fine particulate matter monitors and for published calibration approaches that vary in complexity, ranging from simply applying linear corrections to those requiring co-locating a filter sample for correction with a gravimetric concentration during a baseline visit. Our data includes PurpleAir devices that were co-located in each home with a gravimetric sample for 1-week periods (265 samples from 151 homes). Weekly-averaged gravimetric concentrations ranged between the limit of detection (3 µg/m3) and 330 µg/m3. We found a strong correlation between the PurpleAir monitor and the gravimetric concentration (R>0.91) using internal calibrations provided by the manufacturer. However, the PurpleAir data substantially overestimated indoor concentrations compared to the gravimetric concentration (mean bias error ≥ 23.6 µg/m3 using internal calibrations provided by the manufacturer). Calibrations based on ambient air data maintained high correlations (R ≥ 0.92) and substantially reduced bias (e.g. mean bias error = 10.1 µg/m3 using a US-wide calibration approach). Using a gravimetric sample from a baseline visit to calibrate data for later visits led to an improvement over the internal calibrations, but performed worse than the simpler calibration approaches based on ambient air pollution data. Furthermore, calibrations based on ambient air pollution data performed best when weekly-averaged concentrations did not exceed 30 µg/m3, likely because the majority of the data used to train these models were below this concentration.

2.

Modeling Multivariate Spatial Dependencies Using Graphical Models.

Dey, Debangan; Datta, Abhirup; Banerjee, Sudipto.

N Engl J Stat Data Sci ; 1(2): 283-295, 2023 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-37817840

RESUMO

Graphical models have witnessed significant growth and usage in spatial data science for modeling data referenced over a massive number of spatial-temporal coordinates. Much of this literature has focused on a single or relatively few spatially dependent outcomes. Recent attention has focused upon addressing modeling and inference for substantially large number of outcomes. While spatial factor models and multivariate basis expansions occupy a prominent place in this domain, this article elucidates a recent approach, graphical Gaussian Processes, that exploits the notion of conditional independence among a very large number of spatial processes to build scalable graphical models for fully model-based Bayesian analysis of multivariate spatial data.

3.

Improved fMRI-based pain prediction using Bayesian group-wise functional registration.

Wang, Guoqing; Datta, Abhirup; Lindquist, Martin A.

Biostatistics ; 2023 Oct 06.

Artigo em Inglês | MEDLINE | ID: mdl-37805937

RESUMO

In recent years, the field of neuroimaging has undergone a paradigm shift, moving away from the traditional brain mapping approach towards the development of integrated, multivariate brain models that can predict categories of mental events. However, large interindividual differences in both brain anatomy and functional localization after standard anatomical alignment remain a major limitation in performing this type of analysis, as it leads to feature misalignment across subjects in subsequent predictive models. This article addresses this problem by developing and validating a new computational technique for reducing misalignment across individuals in functional brain systems by spatially transforming each subject's functional data to a common latent template map. Our proposed Bayesian functional group-wise registration approach allows us to assess differences in brain function across subjects and individual differences in activation topology. We achieve the probabilistic registration with inverse-consistency by utilizing the generalized Bayes framework with a loss function for the symmetric group-wise registration. It models the latent template with a Gaussian process, which helps capture spatial features in the template, producing a more precise estimation. We evaluate the method in simulation studies and apply it to data from an fMRI study of thermal pain, with the goal of using functional brain activity to predict physical pain. We find that the proposed approach allows for improved prediction of reported pain scores over conventional approaches. Received on 2 January 2017. Editorial decision on 8 June 2021.

4.

nnSVG for the scalable identification of spatially variable genes using nearest-neighbor Gaussian processes.

Weber, Lukas M; Saha, Arkajyoti; Datta, Abhirup; Hansen, Kasper D; Hicks, Stephanie C.

Nat Commun ; 14(1): 4059, 2023 07 10.

Artigo em Inglês | MEDLINE | ID: mdl-37429865

RESUMO

Feature selection to identify spatially variable genes or other biologically informative genes is a key step during analyses of spatially-resolved transcriptomics data. Here, we propose nnSVG, a scalable approach to identify spatially variable genes based on nearest-neighbor Gaussian processes. Our method (i) identifies genes that vary in expression continuously across the entire tissue or within a priori defined spatial domains, (ii) uses gene-specific estimates of length scale parameters within the Gaussian process models, and (iii) scales linearly with the number of spatial locations. We demonstrate the performance of our method using experimental data from several technological platforms and simulations. A software implementation is available at https://bioconductor.org/packages/nnSVG .

Assuntos

Perfilação da Expressão Gênica , Software , Análise por Conglomerados , Distribuição Normal

5.

Identifying optimal co-location calibration periods for low-cost sensors.

Zamora, Misti Levy; Buehler, Colby; Datta, Abhirup; Gentner, Drew R; Koehler, Kirsten.

Atmos Meas Tech ; 16(1): 169-179, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37323467

RESUMO

Low-cost sensors are often co-located with reference instruments to assess their performance and establish calibration equations, but limited discussion has focused on whether the duration of this calibration period can be optimized. We placed a multipollutant monitor that contained sensors that measure particulate matter smaller than 2.5 µm (PM2.5), carbon monoxide (CO), nitrogen dioxide (NO2), ozone (O3), and nitric oxide (NO) at a reference field site for one year. We developed calibration equations using randomly selected co-location subsets spanning 1 to 180 consecutive days out of the 1-year period and compared the potential root mean square errors (RMSE) and Pearson correlation coefficients (r). The co-located calibration period required to obtain consistent results varied by sensor type, and several factors increased the co-location duration required for accurate calibration, including the response of a sensor to environmental factors, such as temperature or relative humidity (RH), or cross-sensitivities to other pollutants. Using measurements from Baltimore, MD, where a broad range of environmental conditions may be observed over a given year, we found diminishing improvements in the median RMSE for calibration periods longer than about six weeks for all the sensors. The best performing calibration periods were the ones that contained a range of environmental conditions similar to those encountered during the evaluation period (i.e., all other days of the year not used in the calibration). With optimal, varying conditions it was possible to obtain an accurate calibration in as little as one week for all sensors, suggesting that co-location can be minimized if the period is strategically selected and monitored so that the calibration period is representative of the desired measurement setting.

6.

An illustration of model agnostic explainability methods applied to environmental data.

Wikle, Christopher K; Datta, Abhirup; Hari, Bhava Vyasa; Boone, Edward L; Sahoo, Indranil; Kavila, Indulekha; Castruccio, Stefano; Simmons, Susan J; Burr, Wesley S; Chang, Won.

Environmetrics ; 34(1)2023 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-37200542

RESUMO

Historically, two primary criticisms statisticians have of machine learning and deep neural models is their lack of uncertainty quantification and the inability to do inference (i.e., to explain what inputs are important). Explainable AI has developed in the last few years as a sub-discipline of computer science and machine learning to mitigate these concerns (as well as concerns of fairness and transparency in deep modeling). In this article, our focus is on explaining which inputs are important in models for predicting environmental data. In particular, we focus on three general methods for explainability that are model agnostic and thus applicable across a breadth of models without internal explainability: "feature shuffling", "interpretable local surrogates", and "occlusion analysis". We describe particular implementations of each of these and illustrate their use with a variety of models, all applied to the problem of long-lead forecasting monthly soil moisture in the North American corn belt given sea surface temperature anomalies in the Pacific Ocean.

7.

Laboratory and field evaluation of a low-cost methane sensor and key environmental factors for sensor calibration.

Lin, Joyce J Y; Buehler, Colby; Datta, Abhirup; Gentner, Drew R; Koehler, Kirsten; Zamora, Misti Levy.

Environ Sci Atmos ; 3(4): 683-694, 2023 Apr 13.

Artigo em Inglês | MEDLINE | ID: mdl-37063944

RESUMO

Low-cost sensors enable finer-scale spatiotemporal measurements within the existing methane (CH4) monitoring infrastructure and could help cities mitigate CH4 emissions to meet their climate goals. While initial studies of low-cost CH4 sensors have shown potential for effective CH4 measurement at ambient concentrations, sensor deployment remains limited due to questions about interferences and calibration across environments and seasons. This study evaluates sensor performance across seasons with specific attention paid to the sensor's understudied carbon monoxide (CO) interferences and environmental dependencies through long-term ambient co-location in an urban environment. The sensor was first evaluated in a laboratory using chamber calibration and co-location experiments, and then in the field through two 8 week co-locations with a reference CH4 instrument. In the laboratory, the sensor was sensitive to CH4 concentrations below ambient background concentrations. Different sensor units responded similarly to changing CH4, CO, temperature, and humidity conditions but required individual calibrations to account for differences in sensor response factors. When deployed in-field, co-located with a reference instrument near Baltimore, MD, the sensor captured diurnal trends in hourly CH4 concentration after corrections for temperature, absolute humidity, CO concentration, and hour of day. Variable performance was observed across seasons with the sensor performing well (R 2 = 0.65; percent bias 3.12%; RMSE 0.10 ppm) in the winter validation period and less accurately (R 2 = 0.12; percent bias 3.01%; RMSE 0.08 ppm) in the summer validation period where there was less dynamic range in CH4 concentrations. The results highlight the utility of sensor deployment in more variable ambient CH4 conditions and demonstrate the importance of accounting for temperature and humidity dependencies as well as co-located CO concentrations with low-cost CH4 measurements. We show this can be addressed via Multiple Linear Regression (MLR) models accounting for key covariates to enable urban measurements in areas with CH4 enhancement. Together with individualized calibration prior to deployment, the sensor shows promise for use in low-cost sensor networks and represents a valuable supplement to existing monitoring strategies to identify CH4 hotspots.

8.

Multi-Cause Calibration of Verbal Autopsy-Based Cause-Specific Mortality Estimates of Children and Neonates in Mozambique.

Gilbert, Brian; Fiksel, Jacob; Wilson, Emily; Kalter, Henry; Kante, Almamy; Akum, Aveika; Blau, Dianna; Bassat, Quique; Macicame, Ivalda; Samo Gudo, Eduardo; Black, Robert; Zeger, Scott; Amouzou, Agbessi; Datta, Abhirup.

Am J Trop Med Hyg ; 108(5_Suppl): 78-89, 2023 05 02.

Artigo em Inglês | MEDLINE | ID: mdl-37037430

RESUMO

The Countrywide Mortality Surveillance for Action platform is collecting verbal autopsy (VA) records from a nationally representative sample in Mozambique. These records are used to estimate the national and subnational cause-specific mortality fractions (CSMFs) for children (1-59 months) and neonates (1-28 days). Cross-tabulation of VA-based cause-of-death (COD) determination against that from the minimally invasive tissue sampling (MITS) from the Child Health and Mortality Prevention project revealed important misclassification errors for all the VA algorithms, which if not accounted for will lead to bias in the estimates of CSMF from VA. A recently proposed Bayesian VA-calibration method is used that accounts for this misclassification bias and produces calibrated estimates of CSMF. Both the VA-COD and the MITS-COD can be multi-cause (i.e., suggest more than one probable COD for some of the records). To fully use this probabilistic COD data, we use the multi-cause VA calibration. Two different computer-coded VA algorithms are considered-InSilicoVA and EAVA-and the final CSMF estimates are obtained using an ensemble calibration that uses data from both the algorithms. The calibrated estimates consistently offer a better fit to the data and reveal important changes in the CSMF for both children and neonates in Mozambique after accounting for VA misclassification bias.

Assuntos

Morte , Recém-Nascido , Humanos , Criança , Autopsia , Causas de Morte , Moçambique/epidemiologia , Teorema de Bayes , Calibragem

9.

Correcting for Verbal Autopsy Misclassification Bias in Cause-Specific Mortality Estimates.

Fiksel, Jacob; Gilbert, Brian; Wilson, Emily; Kalter, Henry; Kante, Almamy; Akum, Aveika; Blau, Dianna; Bassat, Quique; Macicame, Ivalda; Samo Gudo, Eduardo; Black, Robert; Zeger, Scott; Amouzou, Agbessi; Datta, Abhirup.

Am J Trop Med Hyg ; 108(5_Suppl): 66-77, 2023 05 02.

Artigo em Inglês | MEDLINE | ID: mdl-37037438

RESUMO

Verbal autopsies (VAs) are extensively used to determine cause of death (COD) in many low- and middle-income countries. However, COD determination from VA can be inaccurate. Computer coded verbal autopsy (CCVA) algorithms used for this task are imperfect and misclassify COD for a large proportion of deaths. If not accounted for, this misclassification leads to biased estimates of cause-specific mortality fractions (CSMFs), a critical piece in health-policy making. Recent work has demonstrated that the knowledge of the CCVA misclassification rates can be used to calibrate raw VA-based CSMF estimates to account for the misclassification bias. In this manuscript, we review the current practices and issues with raw COD predictions from CCVA algorithms and provide a complete primer on how to use the VA calibration approach with the calibratedVA software to correct for verbal autopsy misclassification bias in cause-specific mortality estimates. We use calibratedVA to obtain CSMFs for child (1-59 months) and neonatal deaths using VA data from the Countrywide Mortality Surveillance for Action project in Mozambique.

Assuntos

Algoritmos , Software , Criança , Recém-Nascido , Humanos , Autopsia , Causas de Morte , Moçambique , Mortalidade

10.

Countrywide Mortality Surveillance for Action in Mozambique: Results from a National Sample-Based Vital Statistics System for Mortality and Cause of Death.

Macicame, Ivalda; Kante, Almamy M; Wilson, Emily; Gilbert, Brian; Koffi, Alain; Nhachungue, Sheila; Monjane, Celso; Duce, Pedro; Adriano, Antonio; Chicumbe, Sergio; Jani, Ilesh; Kalter, Henry D; Datta, Abhirup; Zeger, Scott; Black, Robert E; Gudo, Eduardo Samo; Amouzou, Agbessi.

Am J Trop Med Hyg ; 108(5_Suppl): 5-16, 2023 05 02.

Artigo em Inglês | MEDLINE | ID: mdl-37037442

RESUMO

Sub-Saharan Africa lacks timely, reliable, and accurate national data on mortality and causes of death (CODs). In 2018 Mozambique launched a sample registration system (Countrywide Mortality Surveillance for Action [COMSA]-Mozambique), which collects continuous birth, death, and COD data from 700 randomly selected clusters, a nationally representative population of 828,663 persons. Verbal and social autopsy interviews are conducted for COD determination. We analyzed data collected in 2019-2020 to report mortality rates and cause-specific fractions. Cause-specific results were generated using computer-coded verbal autopsy (CCVA) algorithms for deaths among those age 5 years and older. For under-five deaths, the accuracy of CCVA results was increased through calibration with data from minimally invasive tissue sampling. Neonatal and under-five mortality rates were, respectively, 23 (95% CI: 18-28) and 80 (95% CI: 69-91) deaths per 1,000 live births. Mortality rates per 1,000 were 18 (95% CI: 14-21) among age 5-14 years, 26 (95% CI: 20-31) among age 15-24 years, 258 (95% CI: 230-287) among age 25-59 years, and 531 (95% CI: 490-572) among age 60+ years. Urban areas had lower mortality rates than rural areas among children under 15 but not among adults. Deaths due to infections were substantial across all ages. Other predominant causes by age group were prematurity and intrapartum-related events among neonates; diarrhea, malaria, and lower respiratory infections among children 1-59 months; injury, malaria, and diarrhea among children 5-14 years; HIV, injury, and cancer among those age 15-59 years; and cancer and cardiovascular disease at age 60+ years. The COMSA-Mozambique platform offers a rich and unique system for mortality and COD determination and monitoring and an opportunity to build a comprehensive surveillance system.

Assuntos

Doenças Cardiovasculares , Neoplasias , Criança , Recém-Nascido , Adulto , Humanos , Lactente , Pessoa de Meia-Idade , Pré-Escolar , Adolescente , Adulto Jovem , Causas de Morte , Moçambique/epidemiologia , Diarreia , Mortalidade

11.

A DYNAMIC SPATIAL FILTERING APPROACH TO MITIGATE UNDERESTIMATION BIAS IN FIELD CALIBRATED LOW-COST SENSOR AIR POLLUTION DATA.

Heffernan, Claire; PenG, Roger; Gentner, Drew R; Koehler, Kirsten; Datta, Abhirup.

Ann Appl Stat ; 17(4): 3056-3087, 2023 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-38646662

RESUMO

Low-cost air pollution sensors, offering hyper-local characterization of pollutant concentrations, are becoming increasingly prevalent in environmental and public health research. However, low-cost air pollution data can be noisy, biased by environmental conditions, and usually need to be field-calibrated by collocating low-cost sensors with reference-grade instruments. We show, theoretically and empirically, that the common procedure of regression-based calibration using collocated data systematically underestimates high air pollution concentrations, which are critical to diagnose from a health perspective. Current calibration practices also often fail to utilize the spatial correlation in pollutant concentrations. We propose a novel spatial filtering approach to collocation-based calibration of low-cost networks that mitigates the underestimation issue by using an inverse regression. The inverse-regression also allows for incorporating spatial correlations by a second-stage model for the true pollutant concentrations using a conditional Gaussian Process. Our approach works with one or more collocated sites in the network and is dynamic, leveraging spatial correlation with the latest available reference data. Through extensive simulations, we demonstrate how the spatial filtering substantially improves estimation of pollutant concentrations, and measures peak concentrations with greater accuracy. We apply the methodology for calibration of a low-cost PM2.5 network in Baltimore, Maryland, and diagnose air pollution peaks that are missed by the regression-calibration.

12.

Non-linear probabilistic calibration of low-cost environmental air pollution sensor networks for neighborhood level spatiotemporal exposure assessment.

Patton, Andrew; Datta, Abhirup; Zamora, Misti Levy; Buehler, Colby; Xiong, Fulizi; Gentner, Drew R; Koehler, Kirsten.

J Expo Sci Environ Epidemiol ; 32(6): 908-916, 2022 11.

Artigo em Inglês | MEDLINE | ID: mdl-36352094

RESUMO

BACKGROUND: Low-cost sensor networks for monitoring air pollution are an effective tool for expanding spatial resolution beyond the capabilities of existing state and federal reference monitoring stations. However, low-cost sensor data commonly exhibit non-linear biases with respect to environmental conditions that cannot be captured by linear models, therefore requiring extensive lab calibration. Further, these calibration models traditionally produce point estimates or uniform variance predictions which limits their downstream in exposure assessment. OBJECTIVE: Build direct field-calibration models using probabilistic gradient boosted decision trees (GBDT) that eliminate the need for resource-intensive lab calibration and that can be used to conduct probabilistic exposure assessments on the neighborhood level. METHODS: Using data from Plantower A003 particulate matter (PM) sensors deployed in Baltimore, MD from November 2018 through November 2019, a fully probabilistic NGBoost GBDT was trained on raw data from sensors co-located with a federal reference monitoring station and compared against linear regression trained on lab calibrated sensor data. The NGBoost predictions were then used in a Monte Carlo interpolation process to generate high spatial resolution probabilistic exposure gradients across Baltimore. RESULTS: We demonstrate that direct field-calibration of the raw PM2.5 sensor data using a probabilistic GBDT has improved point and distribution accuracies compared to the linear model, particularly at reference measurements exceeding 25 µg/m3, and also on monitors not included in the training set. SIGNIFICANCE: We provide a framework for utilizing the GBDT to conduct probabilistic spatial assessments of human exposure with inverse distance weighting that predicts the probability of a given location exceeding an exposure threshold and provides percentiles of exposure. These probabilistic spatial exposure assessments can be scaled by time and space with minimal modifications. Here, we used the probabilistic exposure assessment methodology to create high quality spatial-temporal PM2.5 maps on the neighborhood-scale in Baltimore, MD. IMPACT STATEMENT: We demonstrate how the use of open-source probabilistic machine learning models for in-place sensor calibration outperforms traditional linear models and does not require an initial laboratory calibration step. Further, these probabilistic models can create uniquely probabilistic spatial exposure assessments following a Monte Carlo interpolation process.

Assuntos

Poluição do Ar , Humanos , Baltimore

13.

Evaluating the Performance of Using Low-Cost Sensors to Calibrate for Cross-Sensitivities in a Multipollutant Network.

Zamora, Misti Levy; Buehler, Colby; Lei, Hao; Datta, Abhirup; Xiong, Fulizi; Gentner, Drew R; Koehler, Kirsten.

ACS ES T Eng ; 2(5): 780-793, 2022 May 13.

Artigo em Inglês | MEDLINE | ID: mdl-35937506

RESUMO

As part of our low-cost sensor network, we colocated multipollutant monitors containing sensors for particulate matter, carbon monoxide, ozone, nitrogen dioxide, and nitrogen monoxide at a reference field site in Baltimore, MD, for 1 year. The first 6 months were used for training multiple regression models, and the second 6 months were used to evaluate the models. The models produced accurate hourly concentrations for all sensors except ozone, which likely requires nonlinear methods to capture peak summer concentrations. The models for all five pollutants produced high Pearson correlation coefficients (r > 0.85), and the hourly averaged calibrated sensor and reference concentrations from the evaluation period were within 3-12%. Each sensor required a distinct set of predictors to achieve the lowest possible root-mean-square error (RMSE). All five sensors responded to environmental factors, and three sensors exhibited cross-sensitives to another air pollutant. We compared the RMSE from models (NO2, O3, and NO) that used colocated regulatory instruments and colocated sensors as predictors to address the cross-sensitivities to another gas, and the corresponding model RMSEs for the three gas models were all within 0.5 ppb. This indicates that low-cost sensor networks can yield useable data if the monitoring package is designed to comeasure key predictors. This is key for the utilization of low-cost sensors by diverse audiences since this does not require continual access to regulatory grade instruments.

14.

Hierarchical multivariate directed acyclic graph autoregressive models for spatial diseases mapping.

Gao, Leiwen; Datta, Abhirup; Banerjee, Sudipto.

Stat Med ; 41(16): 3057-3075, 2022 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-35708210

RESUMO

Disease mapping is an important statistical tool used by epidemiologists to assess geographic variation in disease rates and identify lurking environmental risk factors from spatial patterns. Such maps rely upon spatial models for regionally aggregated data, where neighboring regions tend to exhibit similar outcomes than those farther apart. We contribute to the literature on multivariate disease mapping, which deals with measurements on multiple (two or more) diseases in each region. We aim to disentangle associations among the multiple diseases from spatial autocorrelation in each disease. We develop multivariate directed acyclic graphical autoregression models to accommodate spatial and inter-disease dependence. The hierarchical construction imparts flexibility and richness, interpretability of spatial autocorrelation and inter-disease relationships, and computational ease, but depends upon the order in which the cancers are modeled. To obviate this, we demonstrate how Bayesian model selection and averaging across orders are easily achieved using bridge sampling. We compare our method with a competitor using simulation studies and present an application to multiple cancer mapping using data from the Surveillance, Epidemiology, and End Results program.

Assuntos

Neoplasias , Teorema de Bayes , Simulação por Computador , Humanos , Modelos Estatísticos , Neoplasias/epidemiologia , Análise Espacial

15.

Efficient estimation of SNP heritability using Gaussian predictive process in large scale cohort studies.

Seal, Souvik; Datta, Abhirup; Basu, Saonli.

PLoS Genet ; 18(4): e1010151, 2022 04.

Artigo em Inglês | MEDLINE | ID: mdl-35442943

RESUMO

With the advent of high throughput genetic data, there have been attempts to estimate heritability from genome-wide SNP data on a cohort of distantly related individuals using linear mixed model (LMM). Fitting such an LMM in a large scale cohort study, however, is tremendously challenging due to its high dimensional linear algebraic operations. In this paper, we propose a new method named PredLMM approximating the aforementioned LMM motivated by the concepts of genetic coalescence and Gaussian predictive process. PredLMM has substantially better computational complexity than most of the existing LMM based methods and thus, provides a fast alternative for estimating heritability in large scale cohort studies. Theoretically, we show that under a model of genetic coalescence, the limiting form of our approximation is the celebrated predictive process approximation of large Gaussian process likelihoods that has well-established accuracy standards. We illustrate our approach with extensive simulation studies and use it to estimate the heritability of multiple quantitative traits from the UK Biobank cohort.

Assuntos

Estudo de Associação Genômica Ampla , Modelos Genéticos , Estudos de Coortes , Estudo de Associação Genômica Ampla/métodos , Humanos , Modelos Lineares , Distribuição Normal , Fenótipo , Polimorfismo de Nucleotídeo Único/genética

16.

A transformation-free linear regression for compositional outcomes and predictors.

Fiksel, Jacob; Zeger, Scott; Datta, Abhirup.

Biometrics ; 78(3): 974-987, 2022 09.

Artigo em Inglês | MEDLINE | ID: mdl-33788259

RESUMO

Compositional data are common in many fields, both as outcomes and predictor variables. The inventory of models for the case when both the outcome and predictor variables are compositional is limited, and the existing models are often difficult to interpret in the compositional space, due to their use of complex log-ratio transformations. We develop a transformation-free linear regression model where the expected value of the compositional outcome is expressed as a single Markov transition from the compositional predictor. Our approach is based on estimating equations thereby not requiring complete specification of data likelihood and is robust to different data-generating mechanisms. Our model is simple to interpret, allows for 0s and 1s in both the compositional outcome and covariates, and subsumes several interesting subcases of interest. We also develop permutation tests for linear independence and equality of effect sizes of two components of the predictor. Finally, we show that despite its simplicity, our model accurately captures the relationship between compositional data using two datasets from education and medical research.

Assuntos

Modelos Lineares

17.

RandomForestsGLS: An R package for Random Forests for dependent data.

Saha, Arkajyoti; Basu, Sumanta; Datta, Abhirup.

J Open Source Softw ; 7(71)2022.

Artigo em Inglês | MEDLINE | ID: mdl-37077317

RESUMO

With the modern advances in geographical information systems, remote sensing technologies, and low-cost sensors, we are increasingly encountering datasets where we need to account for spatial or serial dependence. Dependent observations (y 1, y 2, , yn ) with covariates (x1, ..., x n ) can be modeled non-parametrically as yi = m(x i ) + Ïµi , where m(x i ) is mean component and ∈i accounts for the dependency in data. We assume that dependence is captured through a covariance function of the correlated stochastic process ∈i (second order dependence). The correlation is typically a function of "spatial distance" or "time-lag" between two observations. Unlike linear regression, non-linear Machine Learning (ML) methods for estimating the regression function m can capture complex interactions among the variables. However, they often fail to account for the dependence structure, resulting in sub-optimal estimation. On the other hand, specialized software for spatial/temporal data properly models data correlation but lacks flexibility in modeling the mean function m by only focusing on linear models. RandomForestsGLS bridges the gap through a novel rendition of Random Forests (RF) - namely, RF-GLS - by explicitly modeling the spatial/serial data correlation in the RF fitting procedure to substantially improve the estimation of the mean function. Additionally, RandomForestsGLS leverages kriging to perform predictions at new locations for geo-spatial data.

18.

BAYESIAN FUNCTIONAL REGISTRATION OF FMRI ACTIVATION MAPS.

Wang, Guoqing; Datta, Abhirup; Lindquist, Martin A.

Ann Appl Stat ; 16(3): 1676-1699, 2022 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-37396344

RESUMO

Functional magnetic resonance imaging (fMRI) has provided invaluable insight into our understanding of human behavior. However, large inter-individual differences in both brain anatomy and functional localization after anatomical alignment remain a major limitation in conducting group analyses and performing population level inference. This paper addresses this problem by developing and validating a new computational technique for reducing misalignment across individuals in functional brain systems by spatially transforming each subjects functional data to a common reference map. Our proposed Bayesian functional registration approach allows us to assess differences in brain function across subjects and individual differences in activation topology. It combines intensity-based and feature-based information into an integrated framework, and allows inference to be performed on the transformation via the posterior samples. We evaluate the method in a simulation study and apply it to data from a study of thermal pain. We find that the proposed approach provides increased sensitivity for group-level inference.

19.

Graphical Gaussian Process Models for Highly Multivariate Spatial Data.

Dey, Debangan; Datta, Abhirup; Banerjee, Sudipto.

Biometrika ; 109(4): 993-1014, 2022 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-36643962

RESUMO

For multivariate spatial Gaussian process (GP) models, customary specifications of cross-covariance functions do not exploit relational inter-variable graphs to ensure process-level conditional independence among the variables. This is undesirable, especially for highly multivariate settings, where popular cross-covariance functions such as the multivariate Matérn suffer from a "curse of dimensionality" as the number of parameters and floating point operations scale up in quadratic and cubic order, respectively, in the number of variables. We propose a class of multivariate "Graphical Gaussian Processes" using a general construction called "stitching" that crafts cross-covariance functions from graphs and ensures process-level conditional independence among variables. For the Matérn family of functions, stitching yields a multivariate GP whose univariate components are Matérn GPs, and conforms to process-level conditional independence as specified by the graphical model. For highly multivariate settings and decomposable graphical models, stitching offers massive computational gains and parameter dimension reduction. We demonstrate the utility of the graphical Matérn GP to jointly model highly multivariate spatial data using simulation examples and an application to air-pollution modelling.

20.

Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes.

Saha, Arkajyoti; Datta, Abhirup; Banerjee, Sudipto.

J Data Sci ; 20(4): 533-544, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-37786782

RESUMO

Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA