Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 5 de 5
Filtrar
Más filtros












Base de datos
Intervalo de año de publicación
1.
Front Big Data ; 7: 1446071, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39314986

RESUMEN

Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.

2.
Front Big Data ; 7: 1296552, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38495849

RESUMEN

Traditional data curation processes typically depend on human intervention. As data volume and variety grow exponentially, organizations are striving to increase efficiency of their data processes by automating manual processes and making them as unsupervised as possible. An additional challenge is to make these unsupervised processes scalable to meet the demands of increased data volume. This paper describes the parallelization of an unsupervised entity resolution (ER) process. ER is a component of many different data curation processes because it clusters records from multiple data sources that refer to the same real-world entity, such as the same customer, patient, or product. The ability to scale ER processes is particularly important because the computation effort of ER increases quadratically with data volume. The Data Washing Machine (DWM) is an already proposed unsupervised ER system which clusters references from diverse data sources. This work aims at solving the single-threaded nature of the DWM by adopting the parallelization nature of Hadoop MapReduce. However, the proposed parallelization method can be applied to both supervised systems, where matching rules are created by experts, and unsupervised systems, where expert intervention is not required. The DWM uses an entropy measure to self-evaluate the quality of record clustering. The current single-threaded implementations of the DWM in Python and Java are not scalable beyond a few 1,000 records and rely on large, shared memory. The objective of this research is to solve the major two shortcomings of the current design of the DWM which are the creation and usage of shared memory and lack of scalability by leveraging on the power of Hadoop MapReduce. We propose Hadoop Data Washing Machine (HDWM), a MapReduce implementation of the legacy DWM. The scalability of the proposed system is displayed using publicly available ER datasets. Based on results from our experiment, we conclude that HDWM can cluster from 1,000's to millions of equivalent references using multiple computational nodes with independent RAM and CPU cores.

4.
Stud Health Technol Inform ; 294: 327-331, 2022 May 25.
Artículo en Inglés | MEDLINE | ID: mdl-35612086

RESUMEN

Multimorbidity, having a diagnosis of two or more chronic conditions, increases as people age. It is a predictor used in clinical decision-making, but underdiagnosis in underserved populations produces bias in the data that support algorithms used in the healthcare processes. Artificial intelligence (AI) systems could produce inaccurate predictions if patients have multiple unknown conditions. Rural patients are more likely to be underserved and also more likely to have multiple chronic conditions. In this study, data collected during the course of care in a centrally located academic hospital, multimorbidity decreased with rurality. This decrease suggests a bias against rural patients for algorithms that rely on diagnosis information to calculate risk. To test preprocessing to address bias in healthcare data, we measured the amount of discrimination in favor of metropolitan patients in the classification of multimorbidity. We built a model using the biased data to test optimum classification performance. A new unbiased training data set and model were created and tested against unaltered validation data. The new model's classification performance on unaltered data did not diverge significantly from the performance of the initial optimal model trained on the biased data suggesting that bias can be removed with preprocessing.


Asunto(s)
Algoritmos , Inteligencia Artificial , Sesgo , Atención a la Salud , Instituciones de Salud , Humanos
5.
Appl Clin Inform ; 11(4): 622-634, 2020 08.
Artículo en Inglés | MEDLINE | ID: mdl-32968999

RESUMEN

OBJECTIVE: Rule-based data quality assessment in health care facilities was explored through compilation, implementation, and evaluation of 63,397 data quality rules in a single-center case study to assess the ability of rules-based data quality assessment to identify data errors of importance to physicians and system owners. METHODS: We applied a design science framework to design, demonstrate, test, and evaluate a scalable framework with which data quality rules can be managed and used in health care facilities for data quality assessment and monitoring. RESULTS: We identified 63,397 rules partitioned into 28 logic templates. A total of 819,683 discrepancies were identified by 4.5% of the rules. Nine out of 11 participating clinical and operational leaders indicated that the rules identified data quality problems and articulated next steps that they wanted to take based on the reported information. DISCUSSION: The combined rule template and knowledge table approach makes governance and maintenance of otherwise large rule sets manageable. Identified challenges to rule-based data quality monitoring included the lack of curated and maintained knowledge sources relevant to data error detection and lack of organizational resources to support clinical and operational leaders with investigation and characterization of data errors and pursuit of corrective and preventative actions. Limitations of our study included implementation within a single center and dependence of the results on the implemented rule set. CONCLUSION: This study demonstrates a scalable framework (up to 63,397 rules) with which data quality rules can be implemented and managed in health care facilities to identify data errors. The data quality problems identified at the implementation site were important enough to prompt action requests from clinical and operational leaders.


Asunto(s)
Registros Electrónicos de Salud , Informática Médica/métodos , Humanos , Control de Calidad
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...