Search | VHL Search Portal

1.

Efficient estimation of SNP heritability using Gaussian predictive process in large scale cohort studies.

Seal, Souvik; Datta, Abhirup; Basu, Saonli.

PLoS Genet ; 18(4): e1010151, 2022 04.

Article in English | MEDLINE | ID: mdl-35442943

ABSTRACT

With the advent of high throughput genetic data, there have been attempts to estimate heritability from genome-wide SNP data on a cohort of distantly related individuals using linear mixed model (LMM). Fitting such an LMM in a large scale cohort study, however, is tremendously challenging due to its high dimensional linear algebraic operations. In this paper, we propose a new method named PredLMM approximating the aforementioned LMM motivated by the concepts of genetic coalescence and Gaussian predictive process. PredLMM has substantially better computational complexity than most of the existing LMM based methods and thus, provides a fast alternative for estimating heritability in large scale cohort studies. Theoretically, we show that under a model of genetic coalescence, the limiting form of our approximation is the celebrated predictive process approximation of large Gaussian process likelihoods that has well-established accuracy standards. We illustrate our approach with extensive simulation studies and use it to estimate the heritability of multiple quantitative traits from the UK Biobank cohort.

Subject(s)

Genome-Wide Association Study , Models, Genetic , Cohort Studies , Genome-Wide Association Study/methods , Humans , Linear Models , Normal Distribution , Phenotype , Polymorphism, Single Nucleotide/genetics

2.

A causal machine-learning framework for studying policy impact on air pollution: A case-study in COVID-19 lockdowns.

Heffernan, Claire; Koehler, Kirsten; Zamora, Misti Levy; Buehler, Colby; Gentner, Drew R; Peng, Roger D; Datta, Abhirup.

Am J Epidemiol ; 2024 Jul 03.

Article in English | MEDLINE | ID: mdl-38960671

ABSTRACT

When studying the impact of policy interventions or natural experiments on air pollution, such as new environmental policies and opening or closing an industrial facility, careful statistical analysis is needed to separate causal changes from other confounding factors. Using COVID-19 lockdowns as a case-study, we present a comprehensive framework for estimating and validating causal changes from such perturbations. We propose using flexible machine learning-based comparative interrupted time series (CITS) models for estimating such a causal effect. We outline the assumptions required to identify causal effects, showing that many common methods rely on strong assumptions that are relaxed by machine learning models. For empirical validation, we also propose a simple diagnostic criterion, guarding against false effects in baseline years when there was no intervention. The framework is applied to study the impact of COVID-19 lockdowns on NO2 in the eastern US. The machine learning approaches guard against false effects better than common methods and suggest decreases in NO2 in Boston, New York City, Baltimore, and Washington D.C. The study showcases the importance of our validation framework in selecting a suitable method and the utility of a machine learning based CITS model for studying causal changes in air pollution time series.

3.

Improved fMRI-based pain prediction using Bayesian group-wise functional registration.

Wang, Guoqing; Datta, Abhirup; Lindquist, Martin A.

Biostatistics ; 2023 Oct 06.

Article in English | MEDLINE | ID: mdl-37805937

ABSTRACT

In recent years, the field of neuroimaging has undergone a paradigm shift, moving away from the traditional brain mapping approach towards the development of integrated, multivariate brain models that can predict categories of mental events. However, large interindividual differences in both brain anatomy and functional localization after standard anatomical alignment remain a major limitation in performing this type of analysis, as it leads to feature misalignment across subjects in subsequent predictive models. This article addresses this problem by developing and validating a new computational technique for reducing misalignment across individuals in functional brain systems by spatially transforming each subject's functional data to a common latent template map. Our proposed Bayesian functional group-wise registration approach allows us to assess differences in brain function across subjects and individual differences in activation topology. We achieve the probabilistic registration with inverse-consistency by utilizing the generalized Bayes framework with a loss function for the symmetric group-wise registration. It models the latent template with a Gaussian process, which helps capture spatial features in the template, producing a more precise estimation. We evaluate the method in simulation studies and apply it to data from an fMRI study of thermal pain, with the goal of using functional brain activity to predict physical pain. We find that the proposed approach allows for improved prediction of reported pain scores over conventional approaches. Received on 2 January 2017. Editorial decision on 8 June 2021.

4.

Evaluation of Calibration Approaches for Indoor Deployments of PurpleAir Monitors.

Koehler, Kirsten; Wilks, Megan; Green, Tim; Rule, Ana M; Zamora, Misti L; Buehler, Colby; Datta, Abhirup; Gentner, Drew R; Putcha, Nirupama; Hansel, Nadia N; Kirk, Gregory D; Raju, Sarath; McCormack, Meredith.

Atmos Environ (1994) ; 3102023 Oct 01.

Article in English | MEDLINE | ID: mdl-37901719

ABSTRACT

Low-cost air quality monitors are growing in popularity among both researchers and community members to understand variability in pollutant concentrations. Several studies have produced calibration approaches for these sensors for ambient air. These calibrations have been shown to depend primarily on relative humidity, particle size distribution, and particle composition, which may be different in indoor environments. However, despite the fact that most people spend the majority of their time indoors, little is known about the accuracy of commonly used devices indoors. This stems from the fact that calibration data for sensors operating in indoor environments are rare. In this study, we sought to evaluate the accuracy of the raw data from PurpleAir fine particulate matter monitors and for published calibration approaches that vary in complexity, ranging from simply applying linear corrections to those requiring co-locating a filter sample for correction with a gravimetric concentration during a baseline visit. Our data includes PurpleAir devices that were co-located in each home with a gravimetric sample for 1-week periods (265 samples from 151 homes). Weekly-averaged gravimetric concentrations ranged between the limit of detection (3 µg/m3) and 330 µg/m3. We found a strong correlation between the PurpleAir monitor and the gravimetric concentration (R>0.91) using internal calibrations provided by the manufacturer. However, the PurpleAir data substantially overestimated indoor concentrations compared to the gravimetric concentration (mean bias error ≥ 23.6 µg/m3 using internal calibrations provided by the manufacturer). Calibrations based on ambient air data maintained high correlations (R ≥ 0.92) and substantially reduced bias (e.g. mean bias error = 10.1 µg/m3 using a US-wide calibration approach). Using a gravimetric sample from a baseline visit to calibrate data for later visits led to an improvement over the internal calibrations, but performed worse than the simpler calibration approaches based on ambient air pollution data. Furthermore, calibrations based on ambient air pollution data performed best when weekly-averaged concentrations did not exceed 30 µg/m3, likely because the majority of the data used to train these models were below this concentration.

5.

An illustration of model agnostic explainability methods applied to environmental data.

Wikle, Christopher K; Datta, Abhirup; Hari, Bhava Vyasa; Boone, Edward L; Sahoo, Indranil; Kavila, Indulekha; Castruccio, Stefano; Simmons, Susan J; Burr, Wesley S; Chang, Won.

Environmetrics ; 34(1)2023 Feb.

Article in English | MEDLINE | ID: mdl-37200542

ABSTRACT

Historically, two primary criticisms statisticians have of machine learning and deep neural models is their lack of uncertainty quantification and the inability to do inference (i.e., to explain what inputs are important). Explainable AI has developed in the last few years as a sub-discipline of computer science and machine learning to mitigate these concerns (as well as concerns of fairness and transparency in deep modeling). In this article, our focus is on explaining which inputs are important in models for predicting environmental data. In particular, we focus on three general methods for explainability that are model agnostic and thus applicable across a breadth of models without internal explainability: "feature shuffling", "interpretable local surrogates", and "occlusion analysis". We describe particular implementations of each of these and illustrate their use with a variety of models, all applied to the problem of long-lead forecasting monthly soil moisture in the North American corn belt given sea surface temperature anomalies in the Pacific Ocean.

6.

Regularized Bayesian transfer learning for population-level etiological distributions.

Datta, Abhirup; Fiksel, Jacob; Amouzou, Agbessi; Zeger, Scott L.

Biostatistics ; 22(4): 836-857, 2021 10 13.

Article in English | MEDLINE | ID: mdl-32040180

ABSTRACT

Computer-coded verbal autopsy (CCVA) algorithms predict cause of death from high-dimensional family questionnaire data (verbal autopsy) of a deceased individual, which are then aggregated to generate national and regional estimates of cause-specific mortality fractions. These estimates may be inaccurate if CCVA is trained on non-local training data different from the local population of interest. This problem is a special case of transfer learning, i.e., improving classification within a target domain (e.g., a particular population) with the classifier trained in a source-domain. Most transfer learning approaches concern individual-level (e.g., a person's) classification. Social and health scientists such as epidemiologists are often more interested with understanding etiological distributions at the population-level. The sample sizes of their data sets are typically orders of magnitude smaller than those used for common transfer learning applications like image classification, document identification, etc. We present a parsimonious hierarchical Bayesian transfer learning framework to directly estimate population-level class probabilities in a target domain, using any baseline classifier trained on source-domain, and a small labeled target-domain dataset. To address small sample sizes, we introduce a novel shrinkage prior for the transfer error rates guaranteeing that, in absence of any labeled target-domain data or when the baseline classifier is perfectly accurate, our transfer learning agrees with direct aggregation of predictions from the baseline classifier, thereby subsuming the default practice as a special case. We then extend our approach to use an ensemble of baseline classifiers producing an unified estimate. Theoretical and empirical results demonstrate how the ensemble model favors the most accurate baseline classifier. We present data analyses demonstrating the utility of our approach.

Subject(s)

Algorithms , Machine Learning , Bayes Theorem , Causality , Humans

7.

A transformation-free linear regression for compositional outcomes and predictors.

Fiksel, Jacob; Zeger, Scott; Datta, Abhirup.

Biometrics ; 78(3): 974-987, 2022 09.

Article in English | MEDLINE | ID: mdl-33788259

ABSTRACT

Compositional data are common in many fields, both as outcomes and predictor variables. The inventory of models for the case when both the outcome and predictor variables are compositional is limited, and the existing models are often difficult to interpret in the compositional space, due to their use of complex log-ratio transformations. We develop a transformation-free linear regression model where the expected value of the compositional outcome is expressed as a single Markov transition from the compositional predictor. Our approach is based on estimating equations thereby not requiring complete specification of data likelihood and is robust to different data-generating mechanisms. Our model is simple to interpret, allows for 0s and 1s in both the compositional outcome and covariates, and subsumes several interesting subcases of interest. We also develop permutation tests for linear independence and equality of effect sizes of two components of the predictor. Finally, we show that despite its simplicity, our model accurately captures the relationship between compositional data using two datasets from education and medical research.

Subject(s)

Linear Models

8.

Hierarchical multivariate directed acyclic graph autoregressive models for spatial diseases mapping.

Gao, Leiwen; Datta, Abhirup; Banerjee, Sudipto.

Stat Med ; 41(16): 3057-3075, 2022 07 20.

Article in English | MEDLINE | ID: mdl-35708210

ABSTRACT

Disease mapping is an important statistical tool used by epidemiologists to assess geographic variation in disease rates and identify lurking environmental risk factors from spatial patterns. Such maps rely upon spatial models for regionally aggregated data, where neighboring regions tend to exhibit similar outcomes than those farther apart. We contribute to the literature on multivariate disease mapping, which deals with measurements on multiple (two or more) diseases in each region. We aim to disentangle associations among the multiple diseases from spatial autocorrelation in each disease. We develop multivariate directed acyclic graphical autoregression models to accommodate spatial and inter-disease dependence. The hierarchical construction imparts flexibility and richness, interpretability of spatial autocorrelation and inter-disease relationships, and computational ease, but depends upon the order in which the cancers are modeled. To obviate this, we demonstrate how Bayesian model selection and averaging across orders are easily achieved using bridge sampling. We compare our method with a competitor using simulation studies and present an application to multiple cancer mapping using data from the Surveillance, Epidemiology, and End Results program.

Subject(s)

Neoplasms , Bayes Theorem , Computer Simulation , Humans , Models, Statistical , Neoplasms/epidemiology , Spatial Analysis

9.

Peering into the dark (ages) with low-frequency space interferometers: Using the 21-cm signal of neutral hydrogen from the infant universe to probe fundamental (Astro)physics.

Koopmans, Léon V E; Barkana, Rennan; Bentum, Mark; Bernardi, Gianni; Boonstra, Albert-Jan; Bowman, Judd; Burns, Jack; Chen, Xuelei; Datta, Abhirup; Falcke, Heino; Fialkov, Anastasia; Gehlot, Bharat; Gurvits, Leonid; Jelic, Vibor; Klein-Wolt, Marc; Lazio, Joseph; Meerburg, Daan; Mellema, Garrelt; Mertens, Florent; Mesinger, Andrei; Offringa, André; Pritchard, Jonathan; Semelin, Benoit; Subrahmanyan, Ravi; Silk, Joseph; Trott, Cathryn; Vedantham, Harish; Verde, Licia; Zaroubi, Saleem; Zarka, Philippe.

Exp Astron (Dordr) ; 51(3): 1641-1676, 2021.

Article in English | MEDLINE | ID: mdl-34511720

ABSTRACT

The Dark Ages and Cosmic Dawn are largely unexplored windows on the infant Universe (z ~ 200-10). Observations of the redshifted 21-cm line of neutral hydrogen can provide valuable new insight into fundamental physics and astrophysics during these eras that no other probe can provide, and drives the design of many future ground-based instruments such as the Square Kilometre Array (SKA) and the Hydrogen Epoch of Reionization Array (HERA). We review progress in the field of high-redshift 21-cm Cosmology, in particular focussing on what questions can be addressed by probing the Dark Ages at z > 30. We conclude that only a space- or lunar-based radio telescope, shielded from the Earth's radio-frequency interference (RFI) signals and its ionosphere, enable the 21-cm signal from the Dark Ages to be detected. We suggest a generic mission design concept, CoDEX, that will enable this in the coming decades.

10.

Statistical field calibration of a low-cost PM_2.5 monitoring network in Baltimore.

Datta, Abhirup; Saha, Arkajyoti; Zamora, Misti Levy; Buehler, Colby; Hao, Lei; Xiong, Fulizi; Gentner, Drew R; Koehler, Kirsten.

Atmos Environ (1994) ; 2422020 Dec 01.

Article in English | MEDLINE | ID: mdl-32922146

ABSTRACT

Low-cost air pollution monitors are increasingly being deployed to enrich knowledge about ambient air-pollution at high spatial and temporal resolutions. However, unlike regulatory-grade (FEM or FRM) instruments, universal quality standards for low-cost sensors are yet to be established and their data quality varies widely. This mandates thorough evaluation and calibration before any responsible use of such data. This study presents evaluation and field-calibration of the PM2.5 data from a network of low-cost monitors currently operating in Baltimore, MD, which has only one regulatory PM2.5 monitoring site within city limits. Co-location analysis at this regulatory site in Oldtown, Baltimore revealed high variability and significant overestimation of PM2.5 levels by the raw data from these monitors. Universal laboratory corrections reduced the bias in the data, but only partially mitigated the high variability. Eight months of field co-location data at Oldtown were used to develop a gain-offset calibration model, recast as a multiple linear regression. The statistical model offered substantial improvement in prediction quality over the raw or lab-corrected data. The results were robust to the choice of the low-cost monitor used for field-calibration, as well as to different seasonal choices of training period. The raw, lab-corrected and statistically-calibrated data were evaluated for a period of two months following the training period. The statistical model had the highest agreement with the reference data, producing a 24-hour average root-mean-square-error (RMSE) of around 2 µg m -3. To assess transferability of the calibration equations to other monitors in the network, a cross-site evaluation was conducted at a second co-location site in suburban Essex, MD. The statistically calibrated data once again produced the lowest RMSE. The calibrated PM2.5 readings from the monitors in the low-cost network provided insights into the intra-urban spatiotemporal variations of PM2.5 in Baltimore.

11.

Mapping local and global variability in plant trait distributions.

Butler, Ethan E; Datta, Abhirup; Flores-Moreno, Habacuc; Chen, Ming; Wythers, Kirk R; Fazayeli, Farideh; Banerjee, Arindam; Atkin, Owen K; Kattge, Jens; Amiaud, Bernard; Blonder, Benjamin; Boenisch, Gerhard; Bond-Lamberty, Ben; Brown, Kerry A; Byun, Chaeho; Campetella, Giandiego; Cerabolini, Bruno E L; Cornelissen, Johannes H C; Craine, Joseph M; Craven, Dylan; de Vries, Franciska T; Díaz, Sandra; Domingues, Tomas F; Forey, Estelle; González-Melo, Andrés; Gross, Nicolas; Han, Wenxuan; Hattingh, Wesley N; Hickler, Thomas; Jansen, Steven; Kramer, Koen; Kraft, Nathan J B; Kurokawa, Hiroko; Laughlin, Daniel C; Meir, Patrick; Minden, Vanessa; Niinemets, Ülo; Onoda, Yusuke; Peñuelas, Josep; Read, Quentin; Sack, Lawren; Schamp, Brandon; Soudzilovskaia, Nadejda A; Spasojevic, Marko J; Sosinski, Enio; Thornton, Peter E; Valladares, Fernando; van Bodegom, Peter M; Williams, Mathew; Wirth, Christian.

Proc Natl Acad Sci U S A ; 114(51): E10937-E10946, 2017 12 19.

Article in English | MEDLINE | ID: mdl-29196525

ABSTRACT

Our ability to understand and predict the response of ecosystems to a changing environment depends on quantifying vegetation functional diversity. However, representing this diversity at the global scale is challenging. Typically, in Earth system models, characterization of plant diversity has been limited to grouping related species into plant functional types (PFTs), with all trait variation in a PFT collapsed into a single mean value that is applied globally. Using the largest global plant trait database and state of the art Bayesian modeling, we created fine-grained global maps of plant trait distributions that can be applied to Earth system models. Focusing on a set of plant traits closely coupled to photosynthesis and foliar respiration-specific leaf area (SLA) and dry mass-based concentrations of leaf nitrogen ([Formula: see text]) and phosphorus ([Formula: see text]), we characterize how traits vary within and among over 50,000 [Formula: see text]-km cells across the entire vegetated land surface. We do this in several ways-without defining the PFT of each grid cell and using 4 or 14 PFTs; each model's predictions are evaluated against out-of-sample data. This endeavor advances prior trait mapping by generating global maps that preserve variability across scales by using modern Bayesian spatial statistical modeling in combination with a database over three times larger than that in previous analyses. Our maps reveal that the most diverse grid cells possess trait variability close to the range of global PFT means.

Subject(s)

Ecosystem , Plants , Quantitative Trait, Heritable , Environment , Geography , Models, Statistical , Plant Dispersal , Spatial Analysis

12.

Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Mapping.

Taylor-Rodriguez, Daniel; Finley, Andrew O; Datta, Abhirup; Babcock, Chad; Andersen, Hans-Erik; Cook, Bruce D; Morton, Douglas C; Banerjee, Sudipto.

Stat Sin ; 29: 1155-1180, 2019.

Article in English | MEDLINE | ID: mdl-33311955

ABSTRACT

Gathering information about forest variables is an expensive and arduous activity. As such, directly collecting the data required to produce high-resolution maps over large spatial domains is infeasible. Next generation collection initiatives of remotely sensed Light Detection and Ranging (LiDAR) data are specifically aimed at producing complete-coverage maps over large spatial domains. Given that LiDAR data and forest characteristics are often strongly correlated, it is possible to make use of the former to model, predict, and map forest variables over regions of interest. This entails dealing with the high-dimensional (~102) spatially dependent LiDAR outcomes over a large number of locations (~105-106). With this in mind, we develop the Spatial Factor Nearest Neighbor Gaussian Process (SF-NNGP) model, and embed it in a two-stage approach that connects the spatial structure found in LiDAR signals with forest variables. We provide a simulation experiment that demonstrates inferential and predictive performance of the SF-NNGP, and use the two-stage modeling strategy to generate complete-coverage maps of forest variables with associated uncertainty over a large region of boreal forests in interior Alaska.

13.

Estimating Sizes of Key Populations at the National Level: Considerations for Study Design and Analysis.

Edwards, Jessie K; Hileman, Sarah; Donastorg, Yeycy; Zadrozny, Sabrina; Baral, Stefan; Hargreaves, James R; Fearon, Elizabeth; Zhao, Jinkou; Datta, Abhirup; Weir, Sharon S.

Epidemiology ; 29(6): 795-803, 2018 11.

Article in English | MEDLINE | ID: mdl-30119057

ABSTRACT

BACKGROUND: National estimates of the sizes of key populations, including female sex workers, men who have sex with men, and transgender women are critical to inform national and international responses to the HIV pandemic. However, epidemiologic studies typically provide size estimates for only limited high priority geographic areas. This article illustrates a two-stage approach to obtain a national key population size estimate in the Dominican Republic using available estimates and publicly available contextual information. METHODS: Available estimates of key population size in priority areas were augmented with targeted additional data collection in other areas. To combine information from data collected at each stage, we used statistical methods for handling missing data, including inverse probability weights, multiple imputation, and augmented inverse probability weights. RESULTS: Using the augmented inverse probability weighting approach, which provides some protection against parametric model misspecification, we estimated that 3.7% (95% CI = 2.9, 4.7) of the total population of women in the Dominican Republic between the ages of 15 and 49 years were engaged in sex work, 1.2% (95% CI = 1.1, 1.3) of men aged 15-49 had sex with other men, and 0.19% (95% CI = 0.17, 0.21) of people assigned the male sex at birth were transgender. CONCLUSIONS: Viewing the size estimation of key populations as a missing data problem provides a framework for articulating and evaluating the assumptions necessary to obtain a national size estimate. In addition, this paradigm allows use of methods for missing data familiar to epidemiologists.

Subject(s)

Demography/methods , Population Density , Adolescent , Adult , Data Interpretation, Statistical , Dominican Republic/epidemiology , Epidemiologic Measurements , Female , Homosexuality, Male/statistics & numerical data , Humans , Male , Middle Aged , Research Design , Sex Work/statistics & numerical data , Transgender Persons/statistics & numerical data , Young Adult

14.

Modeling Multivariate Spatial Dependencies Using Graphical Models.

Dey, Debangan; Datta, Abhirup; Banerjee, Sudipto.

N Engl J Stat Data Sci ; 1(2): 283-295, 2023 Sep.

Article in English | MEDLINE | ID: mdl-37817840

ABSTRACT

Graphical models have witnessed significant growth and usage in spatial data science for modeling data referenced over a massive number of spatial-temporal coordinates. Much of this literature has focused on a single or relatively few spatially dependent outcomes. Recent attention has focused upon addressing modeling and inference for substantially large number of outcomes. While spatial factor models and multivariate basis expansions occupy a prominent place in this domain, this article elucidates a recent approach, graphical Gaussian Processes, that exploits the notion of conditional independence among a very large number of spatial processes to build scalable graphical models for fully model-based Bayesian analysis of multivariate spatial data.

15.

Identifying optimal co-location calibration periods for low-cost sensors.

Zamora, Misti Levy; Buehler, Colby; Datta, Abhirup; Gentner, Drew R; Koehler, Kirsten.

Atmos Meas Tech ; 16(1): 169-179, 2023.

Article in English | MEDLINE | ID: mdl-37323467

ABSTRACT

Low-cost sensors are often co-located with reference instruments to assess their performance and establish calibration equations, but limited discussion has focused on whether the duration of this calibration period can be optimized. We placed a multipollutant monitor that contained sensors that measure particulate matter smaller than 2.5 µm (PM2.5), carbon monoxide (CO), nitrogen dioxide (NO2), ozone (O3), and nitric oxide (NO) at a reference field site for one year. We developed calibration equations using randomly selected co-location subsets spanning 1 to 180 consecutive days out of the 1-year period and compared the potential root mean square errors (RMSE) and Pearson correlation coefficients (r). The co-located calibration period required to obtain consistent results varied by sensor type, and several factors increased the co-location duration required for accurate calibration, including the response of a sensor to environmental factors, such as temperature or relative humidity (RH), or cross-sensitivities to other pollutants. Using measurements from Baltimore, MD, where a broad range of environmental conditions may be observed over a given year, we found diminishing improvements in the median RMSE for calibration periods longer than about six weeks for all the sensors. The best performing calibration periods were the ones that contained a range of environmental conditions similar to those encountered during the evaluation period (i.e., all other days of the year not used in the calibration). With optimal, varying conditions it was possible to obtain an accurate calibration in as little as one week for all sensors, suggesting that co-location can be minimized if the period is strategically selected and monitored so that the calibration period is representative of the desired measurement setting.

16.

A DYNAMIC SPATIAL FILTERING APPROACH TO MITIGATE UNDERESTIMATION BIAS IN FIELD CALIBRATED LOW-COST SENSOR AIR POLLUTION DATA.

Heffernan, Claire; PenG, Roger; Gentner, Drew R; Koehler, Kirsten; Datta, Abhirup.

Ann Appl Stat ; 17(4): 3056-3087, 2023 Dec.

Article in English | MEDLINE | ID: mdl-38646662

ABSTRACT

Low-cost air pollution sensors, offering hyper-local characterization of pollutant concentrations, are becoming increasingly prevalent in environmental and public health research. However, low-cost air pollution data can be noisy, biased by environmental conditions, and usually need to be field-calibrated by collocating low-cost sensors with reference-grade instruments. We show, theoretically and empirically, that the common procedure of regression-based calibration using collocated data systematically underestimates high air pollution concentrations, which are critical to diagnose from a health perspective. Current calibration practices also often fail to utilize the spatial correlation in pollutant concentrations. We propose a novel spatial filtering approach to collocation-based calibration of low-cost networks that mitigates the underestimation issue by using an inverse regression. The inverse-regression also allows for incorporating spatial correlations by a second-stage model for the true pollutant concentrations using a conditional Gaussian Process. Our approach works with one or more collocated sites in the network and is dynamic, leveraging spatial correlation with the latest available reference data. Through extensive simulations, we demonstrate how the spatial filtering substantially improves estimation of pollutant concentrations, and measures peak concentrations with greater accuracy. We apply the methodology for calibration of a low-cost PM2.5 network in Baltimore, Maryland, and diagnose air pollution peaks that are missed by the regression-calibration.

17.

nnSVG for the scalable identification of spatially variable genes using nearest-neighbor Gaussian processes.

Weber, Lukas M; Saha, Arkajyoti; Datta, Abhirup; Hansen, Kasper D; Hicks, Stephanie C.

Nat Commun ; 14(1): 4059, 2023 07 10.

Article in English | MEDLINE | ID: mdl-37429865

ABSTRACT

Feature selection to identify spatially variable genes or other biologically informative genes is a key step during analyses of spatially-resolved transcriptomics data. Here, we propose nnSVG, a scalable approach to identify spatially variable genes based on nearest-neighbor Gaussian processes. Our method (i) identifies genes that vary in expression continuously across the entire tissue or within a priori defined spatial domains, (ii) uses gene-specific estimates of length scale parameters within the Gaussian process models, and (iii) scales linearly with the number of spatial locations. We demonstrate the performance of our method using experimental data from several technological platforms and simulations. A software implementation is available at https://bioconductor.org/packages/nnSVG .

Subject(s)

Gene Expression Profiling , Software , Cluster Analysis , Normal Distribution

18.

Laboratory and field evaluation of a low-cost methane sensor and key environmental factors for sensor calibration.

Lin, Joyce J Y; Buehler, Colby; Datta, Abhirup; Gentner, Drew R; Koehler, Kirsten; Zamora, Misti Levy.

Environ Sci Atmos ; 3(4): 683-694, 2023 Apr 13.

Article in English | MEDLINE | ID: mdl-37063944

ABSTRACT

Low-cost sensors enable finer-scale spatiotemporal measurements within the existing methane (CH4) monitoring infrastructure and could help cities mitigate CH4 emissions to meet their climate goals. While initial studies of low-cost CH4 sensors have shown potential for effective CH4 measurement at ambient concentrations, sensor deployment remains limited due to questions about interferences and calibration across environments and seasons. This study evaluates sensor performance across seasons with specific attention paid to the sensor's understudied carbon monoxide (CO) interferences and environmental dependencies through long-term ambient co-location in an urban environment. The sensor was first evaluated in a laboratory using chamber calibration and co-location experiments, and then in the field through two 8 week co-locations with a reference CH4 instrument. In the laboratory, the sensor was sensitive to CH4 concentrations below ambient background concentrations. Different sensor units responded similarly to changing CH4, CO, temperature, and humidity conditions but required individual calibrations to account for differences in sensor response factors. When deployed in-field, co-located with a reference instrument near Baltimore, MD, the sensor captured diurnal trends in hourly CH4 concentration after corrections for temperature, absolute humidity, CO concentration, and hour of day. Variable performance was observed across seasons with the sensor performing well (R 2 = 0.65; percent bias 3.12%; RMSE 0.10 ppm) in the winter validation period and less accurately (R 2 = 0.12; percent bias 3.01%; RMSE 0.08 ppm) in the summer validation period where there was less dynamic range in CH4 concentrations. The results highlight the utility of sensor deployment in more variable ambient CH4 conditions and demonstrate the importance of accounting for temperature and humidity dependencies as well as co-located CO concentrations with low-cost CH4 measurements. We show this can be addressed via Multiple Linear Regression (MLR) models accounting for key covariates to enable urban measurements in areas with CH4 enhancement. Together with individualized calibration prior to deployment, the sensor shows promise for use in low-cost sensor networks and represents a valuable supplement to existing monitoring strategies to identify CH4 hotspots.

19.

Multi-Cause Calibration of Verbal Autopsy-Based Cause-Specific Mortality Estimates of Children and Neonates in Mozambique.

Gilbert, Brian; Fiksel, Jacob; Wilson, Emily; Kalter, Henry; Kante, Almamy; Akum, Aveika; Blau, Dianna; Bassat, Quique; Macicame, Ivalda; Samo Gudo, Eduardo; Black, Robert; Zeger, Scott; Amouzou, Agbessi; Datta, Abhirup.

Am J Trop Med Hyg ; 108(5_Suppl): 78-89, 2023 05 02.

Article in English | MEDLINE | ID: mdl-37037430

ABSTRACT

The Countrywide Mortality Surveillance for Action platform is collecting verbal autopsy (VA) records from a nationally representative sample in Mozambique. These records are used to estimate the national and subnational cause-specific mortality fractions (CSMFs) for children (1-59 months) and neonates (1-28 days). Cross-tabulation of VA-based cause-of-death (COD) determination against that from the minimally invasive tissue sampling (MITS) from the Child Health and Mortality Prevention project revealed important misclassification errors for all the VA algorithms, which if not accounted for will lead to bias in the estimates of CSMF from VA. A recently proposed Bayesian VA-calibration method is used that accounts for this misclassification bias and produces calibrated estimates of CSMF. Both the VA-COD and the MITS-COD can be multi-cause (i.e., suggest more than one probable COD for some of the records). To fully use this probabilistic COD data, we use the multi-cause VA calibration. Two different computer-coded VA algorithms are considered-InSilicoVA and EAVA-and the final CSMF estimates are obtained using an ensemble calibration that uses data from both the algorithms. The calibrated estimates consistently offer a better fit to the data and reveal important changes in the CSMF for both children and neonates in Mozambique after accounting for VA misclassification bias.

Subject(s)

Death , Infant, Newborn , Humans , Child , Autopsy , Cause of Death , Mozambique/epidemiology , Bayes Theorem , Calibration

20.

Correcting for Verbal Autopsy Misclassification Bias in Cause-Specific Mortality Estimates.

Fiksel, Jacob; Gilbert, Brian; Wilson, Emily; Kalter, Henry; Kante, Almamy; Akum, Aveika; Blau, Dianna; Bassat, Quique; Macicame, Ivalda; Samo Gudo, Eduardo; Black, Robert; Zeger, Scott; Amouzou, Agbessi; Datta, Abhirup.

Am J Trop Med Hyg ; 108(5_Suppl): 66-77, 2023 05 02.

Article in English | MEDLINE | ID: mdl-37037438

ABSTRACT

Verbal autopsies (VAs) are extensively used to determine cause of death (COD) in many low- and middle-income countries. However, COD determination from VA can be inaccurate. Computer coded verbal autopsy (CCVA) algorithms used for this task are imperfect and misclassify COD for a large proportion of deaths. If not accounted for, this misclassification leads to biased estimates of cause-specific mortality fractions (CSMFs), a critical piece in health-policy making. Recent work has demonstrated that the knowledge of the CCVA misclassification rates can be used to calibrate raw VA-based CSMF estimates to account for the misclassification bias. In this manuscript, we review the current practices and issues with raw COD predictions from CCVA algorithms and provide a complete primer on how to use the VA calibration approach with the calibratedVA software to correct for verbal autopsy misclassification bias in cause-specific mortality estimates. We use calibratedVA to obtain CSMFs for child (1-59 months) and neonatal deaths using VA data from the Countrywide Mortality Surveillance for Action project in Mozambique.

Subject(s)

Algorithms , Software , Child , Infant, Newborn , Humans , Autopsy , Cause of Death , Mozambique , Mortality

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL