Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 95
Filtrar
1.
Proc Natl Acad Sci U S A ; 120(20): e2220789120, 2023 May 16.
Artigo em Inglês | MEDLINE | ID: mdl-37155896

RESUMO

Machine learning (ML) is causing profound changes to chemical research through its powerful statistical and mathematical methodological capabilities. However, the nature of chemistry experiments often sets very high hurdles to collect high-quality data that are deficiency free, contradicting the need of ML to learn from big data. Even worse, the black-box nature of most ML methods requires more abundant data to ensure good transferability. Herein, we combine physics-based spectral descriptors with a symbolic regression method to establish interpretable spectra-property relationship. Using the machine-learned mathematical formulas, we have predicted the adsorption energy and charge transfer of the CO-adsorbed Cu-based MOF systems from their infrared and Raman spectra. The explicit prediction models are robust, allowing them to be transferrable to small and low-quality dataset containing partial errors. Surprisingly, they can be used to identify and clean error data, which are common data scenarios in real experiments. Such robust learning protocol will significantly enhance the applicability of machine-learned spectroscopy for chemical science.

2.
Behav Res Methods ; 56(4): 3242-3258, 2024 04.
Artigo em Inglês | MEDLINE | ID: mdl-38129734

RESUMO

It is common for some participants in self-report surveys to be careless, inattentive, or lacking in effort. Data quality can be severely compromised by responses that are not based on item content (non-content-based [nCB] responses), leading to strong biases in the results of data analysis and misinterpretation of individual scores. In this study, we propose a specification of factor mixture analysis (FMA) to detect nCB responses. We investigated the usefulness and effectiveness of the FMA model in detecting nCB responses using both simulated data (Study 1) and real data (Study 2). In the first study, FMA showed reasonably robust sensitivity (.60 to .86) and excellent specificity (.96 to .99) on mixed-worded scales, suggesting that FMA had superior properties as a screening tool under different sample conditions. However, FMA performance was poor on scales composed of only positive items because of the difficulty in distinguishing acquiescent patterns from valid responses representing high levels of the trait. In Study 2 (real data), FMA detected a minority of cases (6.5%) with highly anomalous response patterns. Removing these cases resulted in a large increase in the fit of the unidimensional model and a substantial reduction in spurious multidimensionality.


Assuntos
Autorrelato , Humanos , Análise Fatorial , Inquéritos e Questionários , Interpretação Estatística de Dados , Modelos Estatísticos
3.
Aust Crit Care ; 37(5): 827-833, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-38600009

RESUMO

BACKGROUND: Data cleaning is the series of procedures performed before a formal statistical analysis, with the aim of reducing the number of error values in a dataset and improving the overall quality of subsequent analyses. Several study-reporting guidelines recommend the inclusion of data-cleaning procedures; however, little practical guidance exists for how to conduct these procedures. OBJECTIVES: This paper aimed to provide practical guidance for how to perform and report rigorous data-cleaning procedures. METHODS: A previously proposed data-quality framework was identified and used to facilitate the description and explanation of data-cleaning procedures. The broader data-cleaning process was broken down into discrete tasks to create a data-cleaning checklist. Examples of the how the various tasks had been undertaken for a previous study using data from the Australia and New Zealand Intensive Care Society Adult Patient Database were also provided. RESULTS: Data-cleaning tasks were described and grouped according to four data-quality domains described in the framework: data integrity, consistency, completeness, and accuracy. Tasks described include creation of a data dictionary, checking consistency of values across multiple variables, quantifying and managing missing data, and the identification and management of outlier values. The data-cleaning task checklist provides a practical summary of the various aspects of the data-cleaning process and will assist clinician researchers in performing this process in the future. CONCLUSIONS: Data cleaning is an integral part of any statistical analysis and helps ensure that study results are valid and reproducible. Use of the data-cleaning task checklist will facilitate the conduct of rigorous data-cleaning processes, with the aim of improving the quality of future research.


Assuntos
Lista de Checagem , Confiabilidade dos Dados , Humanos , Projetos de Pesquisa , Interpretação Estatística de Dados , Austrália , Nova Zelândia
4.
Cytometry A ; 103(1): 71-81, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-35796000

RESUMO

Technical artifacts such as clogging that occur during the data acquisition process of flow cytometry data can cause spurious events and fluorescence intensity shifting that impact the quality of the data and its analysis results. These events should be identified and potentially removed before being passed to the next stage of analysis. flowCut, an R package, automatically detects anomaly events in flow cytometry experiments and flags files for potential review. Its results are on par with manual analysis and it outperforms existing automated approaches.


Assuntos
Citometria de Fluxo , Citometria de Fluxo/métodos , Biologia Computacional
5.
BMC Med Res Methodol ; 23(1): 15, 2023 01 16.
Artigo em Inglês | MEDLINE | ID: mdl-36647014

RESUMO

INTRODUCTION: Surveys are common research tools, and questionnaires revisions are a common occurrence in longitudinal studies. Revisions can, at times, introduce systematic shifts in measures of interest. We formulate that questionnaire revision are a stochastic process with transition matrices. Thus, revision shifts can be reduced by first estimating these transition matrices, which can be utilized in estimation of interested measures. MATERIALS AND METHOD: An ideal survey response model is defined by mapping between the true value of a participant's response to an interval in the grouped data type scale. A population completed surveys multiple times, as modeled with multiple stochastic process. This included stochastic processes related to true values and intervals. While multiple factors contribute to changes in survey responses, here, we explored the method that can mitigate the effects of questionnaire revision. We proposed the Version Alignment Method (VAM), a data preprocessing tool, which can separate the transitions according to revisions from all transitions via solving an optimization problem and using the revision-related transitions to remove the revision effect. To verify VAM, we used simulation data to study the estimation error and a real life MJ dataset containing large amounts of long-term questionnaire responses with several questionnaire revisions to study its feasibility. RESULT: We compared the difference of the annual average between consecutive years. Without adjustment, the difference is 0.593 when the revision occurred, while VAM brought it down to 0.115, where difference between years without revision was in the 0.005, 0.125 range. Furthermore, our method rendered the responses to the same set of intervals, thus comparing the relative frequency of items before and after revisions became possible. The average estimation error in L infinity was 0.0044 which occupied the 95% CI which was constructed by bootstrap analysis. CONCLUSION: Questionnaire revisions can induce different response bias and information loss, thus causing inconsistencies in the estimated measures. Conventional methods can only partly remedy this issue. Our proposal, VAM, can estimate the aggregate difference of all revision-related systematic errors and can reduce the differences, thus reducing inconsistencies in the final estimations of longitudinal studies.


Assuntos
Falha de Prótese , Humanos , Tempo , Inquéritos e Questionários , Reoperação
6.
Sensors (Basel) ; 23(4)2023 Feb 05.
Artigo em Inglês | MEDLINE | ID: mdl-36850388

RESUMO

The Internet of things (IoT) combines different sources of collected data which are processed and analyzed to support smart city applications. Machine learning and deep learning algorithms play a vital role in edge intelligence by minimizing the amount of irrelevant data collected from multiple sources to facilitate these smart city applications. However, the data collected by IoT sensors can often be noisy, redundant, and even empty, which can negatively impact the performance of these algorithms. To address this issue, it is essential to develop effective methods for detecting and eliminating irrelevant data to improve the performance of intelligent IoT applications. One approach to achieving this goal is using data cleaning techniques, which can help identify and remove noisy, redundant, or empty data from the collected sensor data. This paper proposes a deep reinforcement learning (deep RL) framework for IoT sensor data cleaning. The proposed system utilizes a deep Q-network (DQN) agent to classify sensor data into three categories: empty, garbage, and normal. The DQN agent receives input from three received signal strength (RSS) values, indicating the current and two previous sensor data points, and receives reward feedback based on its predicted actions. Our experiments demonstrate that the proposed system outperforms a common time-series-based fully connected neural network (FCDQN) solution, with an accuracy of around 96% after the exploration mode. The use of deep RL for IoT sensor data cleaning is significant because it has the potential to improve the performance of intelligent IoT applications by eliminating irrelevant and harmful data.

7.
BMC Bioinformatics ; 23(1): 488, 2022 Nov 16.
Artigo em Inglês | MEDLINE | ID: mdl-36384457

RESUMO

BACKGROUND: RNA-seq has become a standard technology to quantify mRNA. The measured values usually vary by several orders of magnitude, and while the detection of differences at high values is statistically well grounded, the significance of the differences for rare mRNAs can be weakened by the presence of biological and technical noise. RESULTS: We have developed a method for cleaning RNA-seq data, which improves the detection of differentially expressed genes and specifically genes with low to moderate transcription. Using a data modeling approach, parameters of randomly distributed mRNA counts are identified and reads, most probably originating from technical noise, are removed. We demonstrate that the removal of this random component leads to the significant increase in the number of detected differentially expressed genes, more significant pvalues and no bias towards low-count genes. CONCLUSION: Application of RNAdeNoise to our RNA-seq data on polysome profiling and several published RNA-seq datasets reveals its suitability for different organisms and sequencing technologies such as Illumina and BGI, shows improved detection of differentially expressed genes, and excludes the subjective setting of thresholds for minimal RNA counts. The program, RNA-seq data, resulted gene lists and examples of use are in the supplementary data and at https://github.com/Deyneko/RNAdeNoise .


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , RNA , RNA-Seq , Análise de Sequência de RNA/métodos , RNA Mensageiro
8.
J Anim Ecol ; 91(2): 287-307, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34657296

RESUMO

Modern, high-throughput animal tracking increasingly yields 'big data' at very fine temporal scales. At these scales, location error can exceed the animal's step size, leading to mis-estimation of behaviours inferred from movement. 'Cleaning' the data to reduce location errors is one of the main ways to deal with position uncertainty. Although data cleaning is widely recommended, inclusive, uniform guidance on this crucial step, and on how to organise the cleaning of massive datasets, is relatively scarce. A pipeline for cleaning massive high-throughput datasets must balance ease of use and computationally efficiency, in which location errors are rejected while preserving valid animal movements. Another useful feature of a pre-processing pipeline is efficiently segmenting and clustering location data for statistical methods while also being scalable to large datasets and robust to imperfect sampling. Manual methods being prohibitively time-consuming, and to boost reproducibility, pre-processing pipelines must be automated. We provide guidance on building pipelines for pre-processing high-throughput animal tracking data to prepare it for subsequent analyses. We apply our proposed pipeline to simulated movement data with location errors, and also show how large volumes of cleaned data can be transformed into biologically meaningful 'residence patches', for exploratory inference on animal space use. We use tracking data from the Wadden Sea ATLAS system (WATLAS) to show how pre-processing improves its quality, and to verify the usefulness of the residence patch method. Finally, with tracks from Egyptian fruit bats Rousettus aegyptiacus, we demonstrate the pre-processing pipeline and residence patch method in a fully worked out example. To help with fast implementation of standardised methods, we developed the R package atlastools, which we also introduce here. Our pre-processing pipeline and atlastools can be used with any high-throughput animal movement data in which the high data-volume combined with knowledge of the tracked individuals' movement capacity can be used to reduce location errors. atlastools is easy to use for beginners while providing a template for further development. The common use of simple yet robust pre-processing steps promotes standardised methods in the field of movement ecology and leads to better inferences from data.


Assuntos
Movimento , Animais , Reprodutibilidade dos Testes
9.
Am J Bot ; 109(9): 1488-1507, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-36039662

RESUMO

PREMISE: There has been a great increase in using climatic data in phylogenetic studies over the past decades. However, compiling the high-quality spatial data needed to perform accurate climatic reconstructions is time-consuming and can result in poor geographical coverage. Therefore, researchers often resort to qualitative approximations. Our aim was to evaluate the climatic characterization of the genera of the Asian Palmate Group (AsPG) of Araliaceae as an exemplar lineage of plants showing tropical-temperate transitions. METHODS: We compiled a curated worldwide spatial database of the AsPG genera and created five raster layers representing bioclimatic regionalizations of the world. Then, we crossed the database with the layers to climatically characterize the AsPG genera. RESULTS: We found large disagreement in the climatic characterization of genera among regionalizations and little support for the climatic nature of the tropical-temperate distribution of the AsPG. Both results are attributed to the complexity of delimiting tropical, subtropical, and temperate climates in the world and to the distribution of the study group in regions with transitional climatic conditions. CONCLUSIONS: The complexity in the climatic classification of this example of the tropical-temperate transitions calls for a general climatic revision of other tropical-temperate lineages. In fact, we argue that, to properly evaluate tropical-temperate transitions across the tree of life, we cannot ignore the complexity of distribution ranges.


Assuntos
Araliaceae , Biodiversidade , Clima , Geografia , Filogenia , Plantas
10.
Autom Softw Eng ; 29(2): 52, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36065351

RESUMO

Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets.

11.
BMC Med Inform Decis Mak ; 21(1): 267, 2021 09 17.
Artigo em Inglês | MEDLINE | ID: mdl-34535146

RESUMO

BACKGROUND: The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration. METHODS: We used EHR data collected from primary care in Flanders, Belgium during 1994-2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared. RESULTS: All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1-10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%. CONCLUSIONS: We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.


Assuntos
Registros Eletrônicos de Saúde , Atenção Primária à Saúde , Automação , Bélgica , Bases de Dados Factuais , Humanos
12.
Sensors (Basel) ; 21(21)2021 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-34770383

RESUMO

To improve the recognition rate of chip appearance defects, an algorithm based on a convolution neural network is proposed to identify chip appearance defects of various shapes and features. Furthermore, to address the problems of long training time and low accuracy caused by redundant input samples, an automatic data sample cleaning algorithm based on prior knowledge is proposed to reduce training and classification time, as well as improve the recognition rate. First, defect positions are determined by performing image processing and region-of-interest extraction. Subsequently, interference samples between chip defects are analyzed for data cleaning. Finally, a chip appearance defect classification model based on a convolutional neural network is constructed. The experimental results show that the recognition miss detection rate of this algorithm is zero, and the accuracy rate exceeds 99.5%, thereby fulfilling industry requirements.


Assuntos
Algoritmos , Redes Neurais de Computação , Processamento de Imagem Assistida por Computador , Reconhecimento Psicológico , Projetos de Pesquisa
13.
Sensors (Basel) ; 21(9)2021 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-33922298

RESUMO

The aim of this paper is to provide an extended analysis of the outlier detection, using probabilistic and AI techniques, applied in a demo pilot demand response in blocks of buildings project, based on real experiments and energy data collection with detected anomalies. A numerical algorithm was created to differentiate between natural energy peaks and outliers, so as to first apply a data cleaning. Then, a calculation of the impact in the energy baseline for the demand response computation was implemented, with improved precision, as related to other referenced methods and to the original data processing. For the demo pilot project implemented in the Technical University of Cluj-Napoca block of buildings, without the energy baseline data cleaning, in some cases it was impossible to compute the established key performance indicators (peak power reduction, energy savings, cost savings, CO2 emissions reduction) or the resulted values were far much higher (>50%) and not realistic. Therefore, in real case business models, it is crucial to use outlier's removal. In the past years, both companies and academic communities pulled their efforts in generating input that consist in new abstractions, interfaces, approaches for scalability, and crowdsourcing techniques. Quantitative and qualitative methods were created with the scope of error reduction and were covered in multiple surveys and overviews to cope with outlier detection.

14.
BMC Med Inform Decis Mak ; 20(1): 293, 2020 11 13.
Artigo em Inglês | MEDLINE | ID: mdl-33187520

RESUMO

BACKGROUND: The District Health Information Software-2 (DHIS2) is widely used by countries for national-level aggregate reporting of health-data. To best leverage DHIS2 data for decision-making, countries need to ensure that data within their systems are of the highest quality. Comprehensive, systematic, and transparent data cleaning approaches form a core component of preparing DHIS2 data for analyses. Unfortunately, there is paucity of exhaustive and systematic descriptions of data cleaning processes employed on DHIS2-based data. The aim of this study was to report on methods and results of a systematic and replicable data cleaning approach applied on HIV-data gathered within DHIS2 from 2011 to 2018 in Kenya, for secondary analyses. METHODS: Six programmatic area reports containing HIV-indicators were extracted from DHIS2 for all care facilities in all counties in Kenya from 2011 to 2018. Data variables extracted included reporting rate, reporting timeliness, and HIV-indicator data elements per facility per year. 93,179 facility-records from 11,446 health facilities were extracted from year 2011 to 2018. Van den Broeck et al.'s framework, involving repeated cycles of a three-phase process (data screening, data diagnosis and data treatment), was employed semi-automatically within a generic five-step data-cleaning sequence, which was developed and applied in cleaning the extracted data. Various quality issues were identified, and Friedman analysis of variance conducted to examine differences in distribution of records with selected issues across eight years. RESULTS: Facility-records with no data accounted for 50.23% and were removed. Of the remaining, 0.03% had over 100% in reporting rates. Of facility-records with reporting data, 0.66% and 0.46% were retained for voluntary medical male circumcision and blood safety programmatic area reports respectively, given that few facilities submitted data or offered these services. Distribution of facility-records with selected quality issues varied significantly by programmatic area (p < 0.001). The final clean dataset obtained was suitable to be used for subsequent secondary analyses. CONCLUSIONS: Comprehensive, systematic, and transparent reporting of cleaning-process is important for validity of the research studies as well as data utilization. The semi-automatic procedures used resulted in improved data quality for use in secondary analyses, which could not be secured by automated procedures solemnly.


Assuntos
Confiabilidade dos Dados , Infecções por HIV , Sistemas de Informação em Saúde , Infecções por HIV/epidemiologia , Infecções por HIV/prevenção & controle , Humanos , Quênia , Software
15.
Int J Biometeorol ; 64(11): 1825-1833, 2020 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-32671668

RESUMO

Citizen science involves public participation in research, usually through volunteer observation and reporting. Data collected by citizen scientists are a valuable resource in many fields of research that require long-term observations at large geographic scales. However, such data may be perceived as less accurate than those collected by trained professionals. Here, we analyze the quality of data from a plant phenology network, which tracks biological response to climate change. We apply five algorithms designed to detect outlier observations or inconsistent observers. These methods rely on different quantitative approaches, including residuals of linear models, correlations among observers, deviations from multivariate clusters, and percentile-based outlier removal. We evaluated these methods by comparing the resulting cleaned datasets in terms of time series means, spatial data coverage, and spatial autocorrelations after outlier removal. Spatial autocorrelations were used to determine the efficacy of outlier removal, as they are expected to increase if outliers and inconsistent observations are successfully removed. All data cleaning methods resulted in better Moran's I autocorrelation statistics, with percentile-based outlier removal and the clustering method showing the greatest improvement. Methods based on residual analysis of linear models had the strongest impact on the final bloom time mean estimates, but were among the weakest based on autocorrelation analysis. Removing entire sets of observations from potentially unreliable observers proved least effective. In conclusion, percentile-based outlier removal emerges as a simple and effective method to improve reliability of citizen science phenology observations.


Assuntos
Ciência do Cidadão , Mudança Climática , Participação da Comunidade , Humanos , Reprodutibilidade dos Testes , Voluntários
16.
Behav Res Methods ; 52(6): 2489-2505, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-32462604

RESUMO

In self-report surveys, it is common that some individuals do not pay enough attention and effort to give valid responses. Our aim was to investigate the extent to which careless and insufficient effort responding contributes to the biasing of data. We performed analyses of dimensionality, internal structure, and data reliability of four personality scales (extroversion, conscientiousness, stability, and dispositional optimism) in two independent samples. In order to identify careless/insufficient effort (C/IE) respondents, we used a factor mixture model (FMM) designed to detect inconsistencies of response to items with different semantic polarity. The FMM identified between 4.4% and 10% of C/IE cases, depending on the scale and the sample examined. In the complete samples, all the theoretical models obtained an unacceptable fit, forcing the rejection of the starting hypothesis and making additional wording factors necessary. In the clean samples, all the theoretical models fitted satisfactorily, and the wording factors practically disappeared. Trait estimates in the clean samples were between 4.5% and 11.8% more accurate than in the complete samples. These results show that a limited amount of C/IE data can lead to a drastic deterioration in the fit of the theoretical model, produce large amounts of spurious variance, raise serious doubts about the dimensionality and internal structure of the data, and reduce the reliability with which the trait scores of all surveyed are estimated. Identifying and filtering C/IE responses is necessary to ensure the validity of research results.


Assuntos
Personalidade , Viés , Humanos , Reprodutibilidade dos Testes , Autorrelato , Inquéritos e Questionários
17.
Comput Electr Eng ; 87: 106765, 2020 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-32834174

RESUMO

Deep learning applications with robotics contribute to massive challenges that are not addressed in machine learning. The present world is currently suffering from the COVID-19 pandemic, and millions of lives are getting affected every day with extremely high death counts. Early detection of the disease would provide an opportunity for proactive treatment to save lives, which is the primary research objective of this study. The proposed prediction model caters to this objective following a stepwise approach through cleaning, feature extraction, and classification. The cleaning process constitutes the cleaning of missing values ,which is proceeded by outlier detection using the interpolation of splines and entropy-correlation. The cleaned data is then subjected to a feature extraction process using Principle Component Analysis. A Fitness Oriented Dragon Fly algorithm is introduced to select optimal features, and the resultant feature vector is fed into the Deep Belief Network. The overall accuracy of the proposed scheme experimentally evaluated with the traditional state of the art models. The results highlighted the superiority of the proposed model wherein it was observed to be 6.96% better than Firefly, 6.7% better than Particle Swarm Optimization, 6.96% better than Gray Wolf Optimization ad 7.22% better than Dragonfly Algorithm.

18.
J Med Internet Res ; 21(1): e10013, 2019 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-30622098

RESUMO

BACKGROUND: Since medical research based on big data has become more common, the community's interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE: In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS: The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS: A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS: Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.


Assuntos
Pesquisa Biomédica/métodos , Análise por Conglomerados , Confiabilidade dos Dados , Fezes/química , Humanos
19.
Sensors (Basel) ; 19(9)2019 Apr 27.
Artigo em Inglês | MEDLINE | ID: mdl-31035612

RESUMO

It is an undeniable fact that Internet of Things (IoT) technologies have become a milestone advancement in the digital healthcare domain, since the number of IoT medical devices is grown exponentially, and it is now anticipated that by 2020 there will be over 161 million of them connected worldwide. Therefore, in an era of continuous growth, IoT healthcare faces various challenges, such as the collection, the quality estimation, as well as the interpretation and the harmonization of the data that derive from the existing huge amounts of heterogeneous IoT medical devices. Even though various approaches have been developed so far for solving each one of these challenges, none of these proposes a holistic approach for successfully achieving data interoperability between high-quality data that derive from heterogeneous devices. For that reason, in this manuscript a mechanism is produced for effectively addressing the intersection of these challenges. Through this mechanism, initially, the collection of the different devices' datasets occurs, followed by the cleaning of them. In sequel, the produced cleaning results are used in order to capture the levels of the overall data quality of each dataset, in combination with the measurements of the availability of each device that produced each dataset, and the reliability of it. Consequently, only the high-quality data is kept and translated into a common format, being able to be used for further utilization. The proposed mechanism is evaluated through a specific scenario, producing reliable results, achieving data interoperability of 100% accuracy, and data quality of more than 90% accuracy.


Assuntos
Confiabilidade dos Dados , Atenção à Saúde/métodos , Humanos , Internet , Monitorização Fisiológica/métodos
20.
Sensors (Basel) ; 18(3)2018 Mar 09.
Artigo em Inglês | MEDLINE | ID: mdl-29522456

RESUMO

Given the popularization of GPS technologies, the massive amount of spatiotemporal GPS traces collected by vehicles are becoming a new kind of big data source for urban geographic information extraction. The growing volume of the dataset, however, creates processing and management difficulties, while the low quality generates uncertainties when investigating human activities. Based on the conception of the error distribution law and position accuracy of the GPS data, we propose in this paper a data cleaning method for this kind of spatial big data using movement consistency. First, a trajectory is partitioned into a set of sub-trajectories using the movement characteristic points. In this process, GPS points indicate that the motion status of the vehicle has transformed from one state into another, and are regarded as the movement characteristic points. Then, GPS data are cleaned based on the similarities of GPS points and the movement consistency model of the sub-trajectory. The movement consistency model is built using the random sample consensus algorithm based on the high spatial consistency of high-quality GPS data. The proposed method is evaluated based on extensive experiments, using GPS trajectories generated by a sample of vehicles over a 7-day period in Wuhan city, China. The results show the effectiveness and efficiency of the proposed method.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA