Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 19 de 19
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Entropy (Basel) ; 25(4)2023 Apr 14.
Artigo em Inglês | MEDLINE | ID: mdl-37190449

RESUMO

We propose definitions of fairness in machine learning and artificial intelligence systems that are informed by the framework of intersectionality, a critical lens from the legal, social science, and humanities literature which analyzes how interlocking systems of power and oppression affect individuals along overlapping dimensions including gender, race, sexual orientation, class, and disability. We show that our criteria behave sensibly for any subset of the set of protected attributes, and we prove economic, privacy, and generalization guarantees. Our theoretical results show that our criteria meaningfully operationalize AI fairness in terms of real-world harms, making the measurements interpretable in a manner analogous to differential privacy. We provide a simple learning algorithm using deterministic gradient methods, which respects our intersectional fairness criteria. The measurement of fairness becomes statistically challenging in the minibatch setting due to data sparsity, which increases rapidly in the number of protected attributes and in the values per protected attribute. To address this, we further develop a practical learning algorithm using stochastic gradient methods which incorporates stochastic estimation of the intersectional fairness criteria on minibatches to scale up to big data. Case studies on census data, the COMPAS criminal recidivism dataset, the HHP hospitalization data, and a loan application dataset from HMDA demonstrate the utility of our methods.

2.
Entropy (Basel) ; 25(5)2023 May 22.
Artigo em Inglês | MEDLINE | ID: mdl-37238580

RESUMO

Large corporations, government entities and institutions such as hospitals and census bureaus routinely collect our personal and sensitive information for providing services. A key technological challenge is designing algorithms for these services that provide useful results, while simultaneously maintaining the privacy of the individuals whose data are being shared. Differential privacy (DP) is a cryptographically motivated and mathematically rigorous approach for addressing this challenge. Under DP, a randomized algorithm provides privacy guarantees by approximating the desired functionality, leading to a privacy-utility trade-off. Strong (pure DP) privacy guarantees are often costly in terms of utility. Motivated by the need for a more efficient mechanism with better privacy-utility trade-off, we propose Gaussian FM, an improvement to the functional mechanism (FM) that offers higher utility at the expense of a weakened (approximate) DP guarantee. We analytically show that the proposed Gaussian FM algorithm can offer orders of magnitude smaller noise compared to the existing FM algorithms. We further extend our Gaussian FM algorithm to decentralized-data settings by incorporating the CAPE protocol and propose capeFM. Our method can offer the same level of utility as its centralized counterparts for a range of parameter choices. We empirically show that our proposed algorithms outperform existing state-of-the-art approaches on synthetic and real datasets.

3.
Hum Brain Mapp ; 43(7): 2289-2310, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35243723

RESUMO

Privacy concerns for rare disease data, institutional or IRB policies, access to local computational or storage resources or download capabilities are among the reasons that may preclude analyses that pool data to a single site. A growing number of multisite projects and consortia were formed to function in the federated environment to conduct productive research under constraints of this kind. In this scenario, a quality control tool that visualizes decentralized data in its entirety via global aggregation of local computations is especially important, as it would allow the screening of samples that cannot be jointly evaluated otherwise. To solve this issue, we present two algorithms: decentralized data stochastic neighbor embedding, dSNE, and its differentially private counterpart, DP-dSNE. We leverage publicly available datasets to simultaneously map data samples located at different sites according to their similarities. Even though the data never leaves the individual sites, dSNE does not provide any formal privacy guarantees. To overcome that, we rely on differential privacy: a formal mathematical guarantee that protects individuals from being identified as contributors to a dataset. We implement DP-dSNE with AdaCliP, a method recently proposed to add less noise to the gradients per iteration. We introduce metrics for measuring the embedding quality and validate our algorithms on these metrics against their centralized counterpart on two toy datasets. Our validation on six multisite neuroimaging datasets shows promising results for the quality control tasks of visualization and outlier detection, highlighting the potential of our private, decentralized visualization approach.


Assuntos
Algoritmos , Privacidade , Humanos , Neuroimagem , Controle de Qualidade , Projetos de Pesquisa
4.
IEEE Trans Signal Process ; 69: 6355-6370, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35755147

RESUMO

Blind source separation algorithms such as independent component analysis (ICA) are widely used in the analysis of neuroimaging data. To leverage larger sample sizes, different data holders/sites may wish to collaboratively learn feature representations. However, such datasets are often privacy-sensitive, precluding centralized analyses that pool the data at one site. In this work, we propose a differentially private algorithm for performing ICA in a decentralized data setting. Due to the high dimension and small sample size, conventional approaches to decentralized differentially private algorithms suffer in terms of utility. When centralizing the data is not possible, we investigate the benefit of enabling limited collaboration in the form of generating jointly distributed random noise. We show that such (anti) correlated noise improves the privacy-utility trade-off, and can reach the same level of utility as the corresponding non-private algorithm for certain parameter choices. We validate this benefit using synthetic and real neuroimaging datasets. We conclude that it is possible to achieve meaningful utility while preserving privacy, even in complex signal processing systems.

5.
Neuroimage ; 186: 557-569, 2019 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-30408598

RESUMO

The field of neuroimaging has recently witnessed a strong shift towards data sharing; however, current collaborative research projects may be unable to leverage institutional architectures that collect and store data in local, centralized data centers. Additionally, though research groups are willing to grant access for collaborations, they often wish to maintain control of their data locally. These concerns may stem from research culture as well as privacy and accountability concerns. In order to leverage the potential of these aggregated larger data sets, we require tools that perform joint analyses without transmitting the data. Ideally, these tools would have similar performance and ease of use as their current centralized counterparts. In this paper, we propose and evaluate a new Algorithm, decentralized joint independent component analysis (djICA), which meets these technical requirements. djICA shares only intermediate statistics about the data, plausibly retaining privacy of the raw information to local sites, thus making it amenable to further privacy protections, for example via differential privacy. We validate our method on real functional magnetic resonance imaging (fMRI) data and show that it enables collaborative large-scale temporal ICA of fMRI, a rich vein of analysis as of yet largely unexplored, and which can benefit from the larger-N studies enabled by a decentralized approach. We show that djICA is robust to different distributions of data over sites, and that the temporal components estimated with djICA show activations similar to the temporal functional modes analyzed in previous work, thus solidifying djICA as a new, decentralized method oriented toward the frontiers of temporal independent component analysis.


Assuntos
Algoritmos , Encéfalo/fisiologia , Neuroimagem Funcional/métodos , Imageamento por Ressonância Magnética/métodos , Modelos Teóricos , Adulto , Encéfalo/diagnóstico por imagem , Humanos , Processamento de Imagem Assistida por Computador/métodos
6.
Med Care ; 51(8 Suppl 3): S58-65, 2013 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23774511

RESUMO

OBJECTIVE: Effective data sharing is critical for comparative effectiveness research (CER), but there are significant concerns about inappropriate disclosure of patient data. These concerns have spurred the development of new technologies for privacy-preserving data sharing and data mining. Our goal is to review existing and emerging techniques that may be appropriate for data sharing related to CER. MATERIALS AND METHODS: We adapted a systematic review methodology to comprehensively search the research literature. We searched 7 databases and applied 3 stages of filtering based on titles, abstracts, and full text to identify those works most relevant to CER. RESULTS: On the basis of agreement and using the arbitrage of a third party expert, we selected 97 articles for meta-analysis. Our findings are organized along major types of data sharing in CER applications (ie, institution-to-institution, institution hosted, and public release). We made recommendations based on specific scenarios. LIMITATION: We limited the scope of our study to methods that demonstrated practical impact, eliminating many theoretical studies of privacy that have been surveyed elsewhere. We further limited our study to data sharing for data tables, rather than complex genomic, set valued, time series, text, image, or network data. CONCLUSION: State-of-the-art privacy-preserving technologies can guide the development of practical tools that will scale up the CER studies of the future. However, many challenges remain in this fast moving field in terms of practical evaluations and applications to a wider range of data types.


Assuntos
Pesquisa Comparativa da Efetividade/organização & administração , Confidencialidade , Software , Comportamento Cooperativo , Bases de Dados Bibliográficas , Humanos
7.
Front Neuroinform ; 17: 1207721, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37404336

RESUMO

Collaborative neuroimaging research is often hindered by technological, policy, administrative, and methodological barriers, despite the abundance of available data. COINSTAC (The Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation) is a platform that successfully tackles these challenges through federated analysis, allowing researchers to analyze datasets without publicly sharing their data. This paper presents a significant enhancement to the COINSTAC platform: COINSTAC Vaults (CVs). CVs are designed to further reduce barriers by hosting standardized, persistent, and highly-available datasets, while seamlessly integrating with COINSTAC's federated analysis capabilities. CVs offer a user-friendly interface for self-service analysis, streamlining collaboration, and eliminating the need for manual coordination with data owners. Importantly, CVs can also be used in conjunction with open data as well, by simply creating a CV hosting the open data one would like to include in the analysis, thus filling an important gap in the data sharing ecosystem. We demonstrate the impact of CVs through several functional and structural neuroimaging studies utilizing federated analysis showcasing their potential to improve the reproducibility of research and increase sample sizes in neuroimaging studies.

8.
bioRxiv ; 2023 May 11.
Artigo em Inglês | MEDLINE | ID: mdl-37214791

RESUMO

Collaborative neuroimaging research is often hindered by technological, policy, administrative, and methodological barriers, despite the abundance of available data. COINSTAC is a platform that successfully tackles these challenges through federated analysis, allowing researchers to analyze datasets without publicly sharing their data. This paper presents a significant enhancement to the COINSTAC platform: COINSTAC Vaults (CVs). CVs are designed to further reduce barriers by hosting standardized, persistent, and highly-available datasets, while seamlessly integrating with COINSTAC's federated analysis capabilities. CVs offer a user-friendly interface for self-service analysis, streamlining collaboration and eliminating the need for manual coordination with data owners. Importantly, CVs can also be used in conjunction with open data as well, by simply creating a CV hosting the open data one would like to include in the analysis, thus filling an important gap in the data sharing ecosystem. We demonstrate the impact of CVs through several functional and structural neuroimaging studies utilizing federated analysis showcasing their potential to improve the reproducibility of research and increase sample sizes in neuroimaging studies.

9.
Neuroinformatics ; 21(2): 287-301, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36434478

RESUMO

With the growth of decentralized/federated analysis approaches in neuroimaging, the opportunities to study brain disorders using data from multiple sites has grown multi-fold. One such initiative is the Neuromark, a fully automated spatially constrained independent component analysis (ICA) that is used to link brain network abnormalities among different datasets, studies, and disorders while leveraging subject-specific networks. In this study, we implement the neuromark pipeline in COINSTAC, an open-source neuroimaging framework for collaborative/decentralized analysis. Decentralized exploratory analysis of nearly 2000 resting-state functional magnetic resonance imaging datasets collected at different sites across two cohorts and co-located in different countries was performed to study the resting brain functional network connectivity changes in adolescents who smoke and consume alcohol. Results showed hypoconnectivity across the majority of networks including sensory, default mode, and subcortical domains, more for alcohol than smoking, and decreased low frequency power. These findings suggest that global reduced synchronization is associated with both tobacco and alcohol use. This proof-of-concept work demonstrates the utility and incentives associated with large-scale decentralized collaborations spanning multiple sites.


Assuntos
Encéfalo , Imageamento por Ressonância Magnética , Humanos , Adolescente , Vias Neurais/diagnóstico por imagem , Encéfalo/diagnóstico por imagem , Consumo de Bebidas Alcoólicas , Etanol , Fumar , Mapeamento Encefálico
10.
Neuroinformatics ; 20(4): 981-990, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-35380365

RESUMO

Recent studies have demonstrated that neuroimaging data can be used to estimate biological brain age, as it captures information about the neuroanatomical and functional changes the brain undergoes during development and the aging process. However, researchers often have limited access to neuroimaging data because of its challenging and expensive acquisition process, thereby limiting the effectiveness of the predictive model. Decentralized models provide a way to build more accurate and generalizable prediction models, bypassing the traditional data-sharing methodology. In this work, we propose a decentralized method for biological brain age estimation using support vector regression models and evaluate it on three different feature sets, including both volumetric and voxelwise structural MRI data as well as resting functional MRI data. The results demonstrate that our decentralized brain age regression models can achieve similar performance compared to the models trained with all the data in one location.


Assuntos
Encéfalo , Imageamento por Ressonância Magnética , Imageamento por Ressonância Magnética/métodos , Encéfalo/diagnóstico por imagem , Neuroimagem/métodos
11.
Neuroinformatics ; 19(4): 553-566, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-33462781

RESUMO

There has been an upward trend in developing frameworks that enable neuroimaging researchers to address challenging questions by leveraging data across multiple sites all over the world. One such open-source framework is the Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation (COINSTAC) that works on Windows, macOS, and Linux operating systems and leverages containerized analysis pipelines to analyze neuroimaging data stored locally across multiple physical locations without the need for pooling the data at any point during the analysis. In this paper, the COINSTAC team partnered with a data collection consortium to implement the first-ever decentralized voxelwise analysis of brain imaging data performed outside the COINSTAC development group. Decentralized voxel-based morphometry analysis of over 2000 structural magnetic resonance imaging data sets collected at 14 different sites across two cohorts and co-located in different countries was performed to study the structural changes in brain gray matter which linked to age, body mass index (BMI), and smoking. Results produced by the decentralized analysis were consistent with and extended previous findings in the literature. In particular, a widespread cortical gray matter reduction (resembling a 'default mode network' pattern) and hippocampal increase with age, bilateral increases in the hypothalamus and basal ganglia with BMI, and cingulate and thalamic decreases with smoking. This work provides a critical real-world test of the COINSTAC framework in a "Large-N" study. It showcases the potential benefits of performing multivoxel and multivariate analyses of large-scale neuroimaging data located at multiple sites.


Assuntos
Fatores Etários , Índice de Massa Corporal , Substância Cinzenta , Neuroimagem , Fumar , Adolescente , Encéfalo/diagnóstico por imagem , Substância Cinzenta/diagnóstico por imagem , Humanos , Imageamento por Ressonância Magnética
13.
IEEE J Sel Top Signal Process ; 12(6): 1449-1464, 2018 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-31595179

RESUMO

In many signal processing and machine learning applications, datasets containing private information are held at different locations, requiring the development of distributed privacy-preserving algorithms. Tensor and matrix factorizations are key components of many processing pipelines. In the distributed setting, differentially private algorithms suffer because they introduce noise to guarantee privacy. This paper designs new and improved distributed and differentially private algorithms for two popular matrix and tensor factorization methods: principal component analysis (PCA) and orthogonal tensor decomposition (OTD). The new algorithms employ a correlated noise design scheme to alleviate the effects of noise and can achieve the same noise level as the centralized scenario. Experiments on synthetic and real data illustrate the regimes in which the correlated noise allows performance matching with the centralized setting, outperforming previous methods and demonstrating that meaningful utility is possible while guaranteeing differential privacy.

14.
Artigo em Inglês | MEDLINE | ID: mdl-28210428

RESUMO

Health data derived from electronic health records are increasingly utilized in large-scale population health analyses. Going hand in hand with this increase in data is an increasing number of data breaches. Ensuring privacy and security of these data is a shared responsibility between the public health researcher, collaborators, and their institutions. In this article, we review the requirements of data privacy and security and discuss epidemiologic implications of emerging technologies from the computer science community that can be used for health data. In order to ensure that our needs as researchers are captured in these technologies, we must engage in the dialogue surrounding the development of these tools.

15.
Front Neurosci ; 10: 365, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27594820

RESUMO

The field of neuroimaging has embraced the need for sharing and collaboration. Data sharing mandates from public funding agencies and major journal publishers have spurred the development of data repositories and neuroinformatics consortia. However, efficient and effective data sharing still faces several hurdles. For example, open data sharing is on the rise but is not suitable for sensitive data that are not easily shared, such as genetics. Current approaches can be cumbersome (such as negotiating multiple data sharing agreements). There are also significant data transfer, organization and computational challenges. Centralized repositories only partially address the issues. We propose a dynamic, decentralized platform for large scale analyses called the Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation (COINSTAC). The COINSTAC solution can include data missing from central repositories, allows pooling of both open and "closed" repositories by developing privacy-preserving versions of widely-used algorithms, and incorporates the tools within an easy-to-use platform enabling distributed computation. We present an initial prototype system which we demonstrate on two multi-site data sets, without aggregating the data. In addition, by iterating across sites, the COINSTAC model enables meta-analytic solutions to converge to "pooled-data" solutions (i.e., as if the entire data were in hand). More advanced approaches such as feature generation, matrix factorization models, and preprocessing can be incorporated into such a model. In sum, COINSTAC enables access to the many currently unavailable data sets, a user friendly privacy enabled interface for decentralized analysis, and a powerful solution that complements existing data sharing solutions.

16.
JMLR Workshop Conf Proc ; 2015: 894-902, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26705435

RESUMO

We consider learning from data of variable quality that may be obtained from different heterogeneous sources. Addressing learning from heterogenous data in its full generality is a challenging problem. In this paper, we adopt instead a model in which data is observed through heterogeneous noise, where the noise level reflects the quality of the data source. We study how to use stochastic gradient algorithms to learn in this model. Our study is motivated by two concrete examples where this problem arises naturally: learning with local differential privacy based on data from multiple sources with different privacy requirements, and learning from data with labels of variable quality. The main contribution of this paper is to identify how heterogeneous noise impacts performance. We show that given two datasets with heterogeneous noise, the order in which to use them in standard SGD depends on the learning rate. We propose a method for changing the learning rate as a function of the heterogeneity, and prove new regret bounds for our method in two cases of interest. Experiments on real data show that our method performs better than using a single learning rate and using only the less noisy of the two datasets when the noise level is low to moderate.

17.
Front Neuroinform ; 8: 35, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24778614

RESUMO

The growth of data sharing initiatives for neuroimaging and genomics represents an exciting opportunity to confront the "small N" problem that plagues contemporary neuroimaging studies while further understanding the role genetic markers play in the function of the brain. When it is possible, open data sharing provides the most benefits. However, some data cannot be shared at all due to privacy concerns and/or risk of re-identification. Sharing other data sets is hampered by the proliferation of complex data use agreements (DUAs) which preclude truly automated data mining. These DUAs arise because of concerns about the privacy and confidentiality for subjects; though many do permit direct access to data, they often require a cumbersome approval process that can take months. An alternative approach is to only share data derivatives such as statistical summaries-the challenges here are to reformulate computational methods to quantify the privacy risks associated with sharing the results of those computations. For example, a derived map of gray matter is often as identifiable as a fingerprint. Thus alternative approaches to accessing data are needed. This paper reviews the relevant literature on differential privacy, a framework for measuring and tracking privacy loss in these settings, and demonstrates the feasibility of using this framework to calculate statistics on data distributed at many sites while still providing privacy.

18.
J Am Med Inform Assoc ; 19(5): 750-7, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22511018

RESUMO

OBJECTIVE: Today's clinical research institutions provide tools for researchers to query their data warehouses for counts of patients. To protect patient privacy, counts are perturbed before reporting; this compromises their utility for increased privacy. The goal of this study is to extend current query answer systems to guarantee a quantifiable level of privacy and allow users to tailor perturbations to maximize the usefulness according to their needs. METHODS: A perturbation mechanism was designed in which users are given options with respect to scale and direction of the perturbation. The mechanism translates the true count, user preferences, and a privacy level within administrator-specified bounds into a probability distribution from which the perturbed count is drawn. RESULTS: Users can significantly impact the scale and direction of the count perturbation and can receive more accurate final cohort estimates. Strong and semantically meaningful differential privacy is guaranteed, providing for a unified privacy accounting system that can support role-based trust levels. This study provides an open source web-enabled tool to investigate visually and numerically the interaction between system parameters, including required privacy level and user preference settings. CONCLUSIONS: Quantifying privacy allows system administrators to provide users with a privacy budget and to monitor its expenditure, enabling users to control the inevitable loss of utility. While current measures of privacy are conservative, this system can take advantage of future advances in privacy measurement. The system provides new ways of trading off privacy and utility that are not provided in current study design systems.


Assuntos
Pesquisa Biomédica , Confidencialidade , Armazenamento e Recuperação da Informação/métodos , Sistemas Computadorizados de Registros Médicos/estatística & dados numéricos , Humanos , Modelos Estatísticos , Projetos de Pesquisa , Software , Interface Usuário-Computador
19.
J Mach Learn Res ; 12: 1069-1109, 2011 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21892342

RESUMO

Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA