Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 77
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Proc Natl Acad Sci U S A ; 120(30): e2302028120, 2023 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-37463204

RESUMO

How do statistical dependencies in measurement noise influence high-dimensional inference? To answer this, we study the paradigmatic spiked matrix model of principal components analysis (PCA), where a rank-one matrix is corrupted by additive noise. We go beyond the usual independence assumption on the noise entries, by drawing the noise from a low-order polynomial orthogonal matrix ensemble. The resulting noise correlations make the setting relevant for applications but analytically challenging. We provide characterization of the Bayes optimal limits of inference in this model. If the spike is rotation invariant, we show that standard spectral PCA is optimal. However, for more general priors, both PCA and the existing approximate message-passing algorithm (AMP) fall short of achieving the information-theoretic limits, which we compute using the replica method from statistical physics. We thus propose an AMP, inspired by the theory of adaptive Thouless-Anderson-Palmer equations, which is empirically observed to saturate the conjectured theoretical limit. This AMP comes with a rigorous state evolution analysis tracking its performance. Although we focus on specific noise distributions, our methodology can be generalized to a wide class of trace matrix ensembles at the cost of more involved expressions. Finally, despite the seemingly strong assumption of rotation-invariant noise, our theory empirically predicts algorithmic performance on real data, pointing at strong universality properties.

2.
Cancer ; 2024 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-38662502

RESUMO

INTRODUCTION: Structured data capture requires defined languages such as minimal Common Oncology Data Elements (mCODE). This pilot assessed the feasibility of capturing 5 mCODE categories (stage, disease status, performance status (PS), intent of therapy and intent to change therapy). METHODS: A tool (SmartPhrase) using existing and custom structured data elements was Built to capture 4 data categories (disease status, PS, intent of therapy and intent to change therapy) typically documented as free-text within notes. Existing functionality for stage was supported by the Build. Participant survey data, presence of data (per encounter), and time in chart were collected prior to go-live and repeat timepoints. The anticipated outcome was capture of >50% sustained over time without undue burden. RESULTS: Pre-intervention (5-weeks before go-live), participants had 1390 encounters (1207 patients). The median percent capture across all participants was 32% for stage; no structured data was available for other categories pre-intervention. During a 6-month pilot with 14 participants across three sites, 4995 encounters (3071 patients) occurred. The median percent capture across all participants and all post-intervention months increased to 64% for stage and 81%-82% for the other data categories post-intervention. No increase in participant time in chart was noted. Participants reported that data were meaningful to capture. CONCLUSIONS: Structured data can be captured (1) in real-time, (2) sustained over time without (3) undue provider burden using note-based tools. Our system is expanding the pilot, with integration of these data into clinical decision support, practice dashboards and potential for clinical trial matching.

3.
Molecules ; 29(13)2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-38999091

RESUMO

In the organic laboratory, the 13C nuclear magnetic resonance (NMR) spectrum of a newly synthesized compound remains an essential step in elucidating its structure. For the chemist, the interpretation of such a spectrum, which is a set of chemical-shift values, is made easier if he/she has a tool capable of predicting with sufficient accuracy the carbon-shift values from the structure he/she intends to prepare. As there are few open-source methods for accurately estimating this property, we applied our graph-machine approach to build models capable of predicting the chemical shifts of carbons. For this study, we focused on benzene compounds, building an optimized model derived from training a database of 10,577 chemical shifts originating from 2026 structures that contain up to ten types of non-carbon atoms, namely H, O, N, S, P, Si, and halogens. It provides a training root-mean-squared relative error (RMSRE) of 0.5%, i.e., a root-mean-squared error (RMSE) of 0.6 ppm, and a mean absolute error (MAE) of 0.4 ppm for estimating the chemical shifts of the 10k carbons. The predictive capability of the graph-machine model is also compared with that of three commercial packages on a dataset of 171 original benzenic structures (1012 chemical shifts). The graph-machine model proves to be very efficient in predicting chemical shifts, with an RMSE of 0.9 ppm, and compares favorably with the RMSEs of 3.4, 1.8, and 1.9 ppm computed with the ChemDraw v. 23.1.1.3, ACD v. 11.01, and MestReNova v. 15.0.1-35756 packages respectively. Finally, a Docker-based tool is proposed to predict the carbon chemical shifts of benzenic compounds solely from their SMILES codes.

4.
Med Ref Serv Q ; 43(2): 196-202, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38722609

RESUMO

Named entity recognition (NER) is a powerful computer system that utilizes various computing strategies to extract information from raw text input, since the early 1990s. With rapid advancement in AI and computing, NER models have gained significant attention and been serving as foundational tools across numerus professional domains to organize unstructured data for research and practical applications. This is particularly evident in the medical and healthcare fields, where NER models are essential in efficiently extract critical information from complex documents that are challenging for manual review. Despite its successes, NER present limitations in fully comprehending natural language nuances. However, the development of more advanced and user-friendly models promises to improve work experiences of professional users significantly.


Assuntos
Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Armazenamento e Recuperação da Informação/métodos , Humanos , Inteligência Artificial
5.
Regul Toxicol Pharmacol ; 142: 105426, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37277057

RESUMO

In the European Union, the Chemicals Strategy for Sustainability (CSS) highlights the need to enhance the identification and assessment of substances of concern while reducing animal testing, thus fostering the development and use of New Approach Methodologies (NAMs) such as in silico, in vitro and in chemico. In the United States, the Tox21 strategy aims at shifting toxicological assessments away from traditional animal studies towards target-specific, mechanism-based and biological observations mainly obtained by using NAMs. Many other jurisdictions around the world are also increasing the use of NAMs. Hence, the provision of dedicated non-animal toxicological data and reporting formats as a basis for chemical risk assessment is necessary. Harmonising data reporting is crucial when aiming at re-using and sharing data for chemical risk assessment across jurisdictions. The OECD has developed a series of OECD Harmonised Templates (OHT), which are standard data formats designed for reporting information used for the risk assessment of chemicals relevant to their intrinsic properties, including effects on human health (e.g., toxicokinetics, skin sensitisation, repeated dose toxicity) and the environment (e.g., toxicity to test species and wildlife, biodegradation in soil, metabolism of residues in crops). The objective of this paper is to demonstrate the applicability of the OHT standard format for reporting information under various chemical risk assessment regimes, and to provide users with practical guidance on the use of OHT 201, in particular to report test results on intermediate effects and mechanistic information.


Assuntos
Organização para a Cooperação e Desenvolvimento Econômico , Pele , Humanos , Medição de Risco/métodos
6.
Stat Modelling ; 23(3): 203-227, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37334164

RESUMO

Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an ℓ2 penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.

7.
Molecules ; 28(19)2023 Sep 26.
Artigo em Inglês | MEDLINE | ID: mdl-37836648

RESUMO

The refractive index (RI) of liquids is a key physical property of molecular compounds and materials. In addition to its ubiquitous role in physics, it is also exploited to impart specific optical properties (transparency, opacity, and gloss) to materials and various end-use products. Since few methods exist to accurately estimate this property, we have designed a graph machine model (GMM) capable of predicting the RI of liquid organic compounds containing up to 16 different types of atoms and effective in discriminating between stereoisomers. Using 8267 carefully checked RI values from the literature and the corresponding 2D organic structures, the GMM provides a training root mean square relative error of less than 0.5%, i.e., an RMSE of 0.004 for the estimation of the refractive index of the 8267 compounds. The GMM predictive ability is also compared to that obtained by several fragment-based approaches. Finally, a Docker-based tool is proposed to predict the RI of organic compounds solely from their SMILES code. The GMM developed is easy to apply, as shown by the video tutorials provided on YouTube.

8.
Biostatistics ; 21(2): 219-235, 2020 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-30192903

RESUMO

We consider high-dimensional regression over subgroups of observations. Our work is motivated by biomedical problems, where subsets of samples, representing for example disease subtypes, may differ with respect to underlying regression models. In the high-dimensional setting, estimating a different model for each subgroup is challenging due to limited sample sizes. Focusing on the case in which subgroup-specific models may be expected to be similar but not necessarily identical, we treat subgroups as related problem instances and jointly estimate subgroup-specific regression coefficients. This is done in a penalized framework, combining an $\ell_1$ term with an additional term that penalizes differences between subgroup-specific coefficients. This gives solutions that are globally sparse but that allow information-sharing between the subgroups. We present algorithms for estimation and empirical results on simulated data and using Alzheimer's disease, amyotrophic lateral sclerosis, and cancer datasets. These examples demonstrate the gains joint estimation can offer in prediction as well as in providing subgroup-specific sparsity patterns.


Assuntos
Algoritmos , Pesquisa Biomédica/métodos , Bioestatística/métodos , Prognóstico , Análise de Regressão , Doença de Alzheimer/diagnóstico , Esclerose Lateral Amiotrófica/diagnóstico , Simulação por Computador , Humanos , Neoplasias/tratamento farmacológico , Projetos de Pesquisa
9.
Health Care Manag Sci ; 24(4): 716-741, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34031792

RESUMO

Early identification of resource needs is instrumental in promoting efficient hospital resource management. Hospital information systems, and electronic health records (EHR) in particular, collect valuable demographic and clinical patient data from the moment patients are admitted, which can help predict expected resource needs in early stages of patient episodes. To this end, this article proposes a data mining methodology to systematically obtain predictions for relevant managerial variables by leveraging structured EHR data. Specifically, these managerial variables are: i) Diagnosis categories, ii) procedure codes, iii) diagnosis-related groups (DRGs), iv) outlier episodes and v) length of stay (LOS). The proposed methodology approaches the problem in four stages: Feature set construction, feature selection, prediction model development, and model performance evaluation. We tested this approach with an EHR dataset of 5,089 inpatient episodes and compared different classification and regression models (for categorical and continuous variables, respectively), performed temporal analysis of model performance, analyzed the impact of training set homogeneity on performance and assessed the contribution of different EHR data elements for model predictive power. Overall, our results indicate that inpatient EHR data can effectively be leveraged to inform resource management on multiple perspectives. Logistic regression (combined with minimal redundancy maximum relevance feature selection) and bagged decision trees yielded best results for predicting categorical and numerical managerial variables, respectively. Furthermore, our temporal analysis indicated that, while DRG classes are more difficult to predict, several diagnosis categories, procedure codes and LOS amongst shorter-stay patients can be predicted with higher confidence in early stages of patient stay. Lastly, value of information analysis indicated that diagnoses, medication and structured assessment forms were the most valuable EHR data elements in predicting managerial variables of interest through a data mining approach.


Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Mineração de Dados , Hospitais , Humanos , Modelos Logísticos
10.
J Med Internet Res ; 23(5): e25714, 2021 05 06.
Artigo em Inglês | MEDLINE | ID: mdl-33835932

RESUMO

BACKGROUND: The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented "infodemic"; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis-related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. OBJECTIVE: The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19-related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. METHODS: To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. RESULTS: REDASA (Realtime Data Synthesis and Analysis) is now one of the world's largest and most up-to-date sources of COVID-19-related evidence; it consists of 104,000 documents. By capturing curators' critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19-related information and represent around 10% of all papers about COVID-19. CONCLUSIONS: This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA's design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers' critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world's largest COVID-19-related data corpora for searches and curation.


Assuntos
COVID-19/epidemiologia , Processamento de Linguagem Natural , Ferramenta de Busca/métodos , Interpretação Estatística de Dados , Conjuntos de Dados como Assunto , Humanos , Internet , Estudos Longitudinais , SARS-CoV-2/isolamento & purificação
11.
Radiologe ; 61(11): 1005-1013, 2021 Nov.
Artigo em Alemão | MEDLINE | ID: mdl-34581842

RESUMO

CLINICAL ISSUE: Structured reporting has been one of the most discussed topics in radiology for years. Currently, there is a lack of user-friendly software solutions that are integrated into the IT infrastructure of hospitals and practices to allow efficient data entry. STANDARD RADIOLOGICAL METHODS: Radiological reports are mostly generated as free text documents, either dictated via speech recognition systems or typed. In addition, text components are used to create reports of normal findings that can be further edited and complemented by free text. METHODOLOGICAL INNOVATIONS: Software-based reporting systems can combine speech recognition systems with radiological reporting templates in the form of interactive decision trees. A technical integration into RIS ("radiological information system"), PACS ("picture archiving and communication system"), and AV ("advanced visualization") systems via application programming interfaces and interoperability standards can enable efficient processes and the generation of machine-readable report data. PERFORMANCE: Structured and semantically annotated clinical data collected via the reporting system are immediately available for epidemiological data analysis and continuous AI training. EVALUATION: The use of structured reporting in routine radiological diagnostics involves an initial transition phase. A successful implementation further requires close integration of the technical infrastructure of several systems. PRACTICAL RECOMMENDATIONS: By using a hybrid reporting solution, radiological reports with different levels of structure can be generated. Clinical questions or procedural information can be semi-automatically transferred, thereby eliminating avoidable errors and increasing productivity.


Assuntos
Sistemas de Informação em Radiologia , Radiologia , Humanos , Software , Integração de Sistemas , Fluxo de Trabalho
12.
J Am Acad Dermatol ; 82(3): 773-775, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31682858

RESUMO

The federal mandate for electronic health record (EHR) keeping for health care providers impacted the burden placed on dermatologists for medical documentation. The hope that EHR would improve care quality and efficiency and reduce health disparities has yet to be fully realized. Despite the significant time and effort spent on documentation, the majority of EHR clinical data remain unstructured and therefore, difficult to process and analyze. Structured data can provide a way for dermatologists and data scientists to make more effective use of clinical data-not only to improve the dermatologist's experience with EHRs, but also to manage technology-related administrative burden, accelerate understanding of disease, and enhance care delivery for patients. Understanding the importance of structured data will allow dermatologists to actively engage in how clinical data will be collected and used to advance patient care.


Assuntos
Dermatologia/normas , Registros Eletrônicos de Saúde , Assistência ao Paciente/normas , Qualidade da Assistência à Saúde , Dermatopatias/terapia , Documentação/normas , Humanos
13.
BMC Med Inform Decis Mak ; 19(1): 242, 2019 11 27.
Artigo em Inglês | MEDLINE | ID: mdl-31775737

RESUMO

BACKGROUND: This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least 10 % of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures. METHOD: We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 "topics" with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models - one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data. RESULTS: In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone. CONCLUSIONS: Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.


Assuntos
Término Precoce de Ensaios Clínicos , Aprendizado de Máquina , Modelos Teóricos , Processamento de Linguagem Natural , Ensaios Clínicos como Assunto , Bases de Dados Factuais , Humanos , Narração
14.
J Digit Imaging ; 32(6): 1044-1051, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31289979

RESUMO

Cancer Care Ontario (CCO) is the clinical advisor to the Ontario Ministry of Health and Long-Term Care for the funding and delivery of cancer services. Data contained in radiology reports are inaccessible for analysis without significant manual cost and effort. Synoptic reporting includes highly structured reporting and discrete data capture, which could unlock these data for clinical and evaluative purposes. To assess the feasibility of implementing synoptic radiology reporting, a trial implementation was conducted at one hospital within CCO's Lung Cancer Screening Pilot for People at High Risk. This project determined that it is feasible to capture synoptic data with some barriers. Radiologists require increased awareness when reporting cases with a large number of nodules due to lack of automation within the system. These challenges may be mitigated by implementation of some report automation. Domains such as pathology and public health reporting have addressed some of these challenges with standardized reports based on interoperable standards, and radiology could borrow techniques from these domains to assist in implementing synoptic reporting. Data extraction from the reports could also be significantly automated to improve the process and reduce the workload in collecting the data. RadLex codes aided the difficult data extraction process, by helping label potential ambiguity with common terms and machine-readable identifiers.


Assuntos
Neoplasias Pulmonares/diagnóstico por imagem , Projetos de Pesquisa/estatística & dados numéricos , Relatório de Pesquisa , Tomografia Computadorizada por Raios X/métodos , Humanos , Pulmão/diagnóstico por imagem , Ontário , Doses de Radiação , Radiologia
15.
Genet Epidemiol ; 41(1): 51-60, 2017 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-27873357

RESUMO

The use of data analytics across the entire healthcare value chain, from drug discovery and development through epidemiology to informed clinical decision for patients or policy making for public health, has seen an explosion in the recent years. The increase in quantity and variety of data available together with the improvement of storing capabilities and analytical tools offer numerous possibilities to all stakeholders (manufacturers, regulators, payers, healthcare providers, decision makers, researchers) but most importantly, it has the potential to improve general health outcomes if we learn how to exploit it in the right way. This article looks at the different sources of data and the importance of unstructured data. It goes on to summarize current and potential future uses in drug discovery, development, and monitoring as well as in public and personal healthcare; including examples of good practice and recent developments. Finally, we discuss the main practical and ethical challenges to unravel the full potential of big data in healthcare and conclude that all stakeholders need to work together towards the common goal of making sense of the available data for the common good.


Assuntos
Conjuntos de Dados como Assunto/estatística & dados numéricos , Tomada de Decisões , Atenção à Saúde , Descoberta de Drogas , Medicina de Precisão , Saúde Pública , Genômica , Humanos
16.
BMC Med Inform Decis Mak ; 18(1): 109, 2018 11 26.
Artigo em Inglês | MEDLINE | ID: mdl-30477491

RESUMO

BACKGROUND: With advancements in information technology, computerized physician order entry (CPOE) and electronic Medical Records (eMR), have become widely utilized in medical settings. The predominant mode of CPOE in Taiwan is free text entry (FTE). Dynamic structured data entry (DSDE) was introduced more recently, and has increasingly drawn attention from hospitals across Taiwan. This study assesses how DSDE compares to FTE for CPOE. METHODS: A quasi-experimental study was employed to investigate the time-savings, productivity, and efficiency effects of DSDE in an outpatient setting in the gynecological department of a major hospital in Taiwan. Trained female actor patients were employed in trials of both entry methods. Data were submitted to Shapiro-Wilk and Shapiro-Francia tests to assess normality, and then to paired t-tests to assess differences between DSDE and FTE. RESULTS: Relative to FTE, the use of DSDE resulted in an average of 97% time saved and 55% more abundant and detailed content in medical records. In addition, for each clause entry in a medical record, the time saved is 133% for DSDE compared to FTE. CONCLUSION: The results suggest that DSDE is a much more efficient and productive entry method for clinicians in hospital outpatient settings. Upgrading eMR systems to the DSDE format would benefit both patients and clinicians.


Assuntos
Registros Eletrônicos de Saúde , Departamentos Hospitalares , Sistemas de Registro de Ordens Médicas , Ambulatório Hospitalar , Adulto , Registros Eletrônicos de Saúde/organização & administração , Registros Eletrônicos de Saúde/normas , Registros Eletrônicos de Saúde/estatística & dados numéricos , Feminino , Ginecologia , Departamentos Hospitalares/organização & administração , Departamentos Hospitalares/normas , Departamentos Hospitalares/estatística & dados numéricos , Humanos , Sistemas de Registro de Ordens Médicas/organização & administração , Sistemas de Registro de Ordens Médicas/normas , Sistemas de Registro de Ordens Médicas/estatística & dados numéricos , Ambulatório Hospitalar/organização & administração , Ambulatório Hospitalar/normas , Ambulatório Hospitalar/estatística & dados numéricos , Taiwan
17.
BMC Genomics ; 18(1): 242, 2017 03 21.
Artigo em Inglês | MEDLINE | ID: mdl-28327106

RESUMO

BACKGROUND: Genomic datasets accompanying scientific publications show a surprisingly high rate of gene name corruption. This error is generated when files and tables are imported into Microsoft Excel and certain gene symbols are automatically converted into dates. RESULTS: We have developed Truke, a fexible Web tool to detect, tag and fix, if possible, such misconversions. Aside, Truke is language and regional locale-aware, providing file format customization (decimal symbol, field sepator, etc.) following user's preferences. CONCLUSIONS: Truke is a data format conversion tool with a unique corrupted gene symbol detection utility. Truke is freely available without registration at http://maplab.cat/truke .


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Software , Navegador
18.
Biometrics ; 73(2): 529-539, 2017 06.
Artigo em Inglês | MEDLINE | ID: mdl-27649087

RESUMO

In many scientific and engineering fields, advanced experimental and computing technologies are producing data that are not just high dimensional, but also internally structured. For instance, statistical units may have heterogeneous origins from distinct studies or subpopulations, and features may be naturally partitioned based on experimental platforms generating them, or on information available about their roles in a given phenomenon. In a regression analysis, exploiting this known structure in the predictor dimension reduction stage that precedes modeling can be an effective way to integrate diverse data. To pursue this, we propose a novel Sufficient Dimension Reduction (SDR) approach that we call structured Ordinary Least Squares (sOLS). This combines ideas from existing SDR literature to merge reductions performed within groups of samples and/or predictors. In particular, it leads to a version of OLS for grouped predictors that requires far less computation than recently proposed groupwise SDR procedures, and provides an informal yet effective variable selection tool in these settings. We demonstrate the performance of sOLS by simulation and present a first application to genomic data. The R package "sSDR," publicly available on CRAN, includes all procedures necessary to implement the sOLS approach.


Assuntos
Análise dos Mínimos Quadrados , Biometria , Genômica
19.
IEEE Trans Knowl Data Eng ; 29(3): 698-711, 2017 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-28943741

RESUMO

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.

20.
J Med Syst ; 41(2): 29, 2017 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-28028764

RESUMO

The Finnish Patient Data Repository is a nationwide electronic health record (EHR) system collecting patient data from all healthcare providers. The usefulness of the large amount of data stored in the system depends on the underlying data structures, and thus a solid understanding of these structures is in focus in further development of the data repository. This study seeks to improve that understanding by a systematic literature review. The review takes the physician's perspective to the use and usefulness of the data structures. The articles included in this review study data structures intended to be used in the actual care process. Secondary use and nursing aspects have been covered in separate reviews. After applying the predefined inclusion and exclusion criteria only 40 articles were included in the review. The research on widespread systems in everyday use was especially scarce, most studies concentrated on narrow fields. Majority of these studies were primarily developed for specialist use in secondary care units. Most structures or applications studied were at an early stage of development. In many applications the use of structured data was found to improve the completeness of the documented data and facilitate its automated use. However, there seem to be some applications where narrative text cannot be easily replaced by structured data. Usability results regarding structured representation were conflicting. The scattered nature and paucity of research hinders the generalizability of the findings, and from the system design or implementation point of view the practical value of the scientific literature reviewed is limited.


Assuntos
Atitude do Pessoal de Saúde , Sistemas Computadorizados de Registros Médicos/organização & administração , Médicos , Registros Eletrônicos de Saúde/organização & administração , Finlândia , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA