RESUMO
Standardizing clinical laboratory test results is critical for conducting clinical data science research and analysis. However, standardized data processing tools and guidelines are inadequate. In this paper, a novel approach for standardizing categorical test results based on supervised machine learning and the Jaro-Winkler similarity algorithm is proposed. A supervised machine learning model is used in this approach for scalable categorization of the test results into predefined groups or clusters, while Jaro-Winkler similarity is used to map text terms into standard clinical terms within these corresponding groups. The proposed method is applied to 75062 test results from two private hospitals in Bangladesh. The Support Vector Classification algorithm with a linear kernel has a classification accuracy of 98%, which is better than the Random Forest algorithm when categorizing test results. The experiment results show that Jaro-Winkler similarity achieves a remarkable 99.93% success rate in the test result standardization for the majority of groups with manual validation. The proposed method outperforms previous studies that concentrated on standardizing test results using rule-based classifiers on a smaller number of groups and distance similarities such as Cosine similarity or Levenshtein distance. Furthermore, when applied to the publicly available MIMIC-III dataset, our approach also performs excellently. All these findings show that the proposed standardization technique can be very beneficial for clinical big data research, particularly for national clinical research data hubs in low- and middle-income countries.
RESUMO
BACKGROUND AND OBJECTIVE: Diabetes is a disease of impaired blood glucose regulation due to the absence or insufficient secretion of insulin hormone or insulin resistance induced in the human body. In literature, the impact of exercise is considered in few models based on the minimal representation of glucose dynamics along with the assumption that no endogenous insulin is produced in the body. Hence these models are not capable of describing diabetic behavior which is independent of exogenous insulin. This type of diabetes, known as type-2, affects almost 90% of the total diabetes population. In this article, a constraint-based comprehensive physiological model of blood glucose dynamics is aimed to build for filling up the gap in the literature. METHODS: For physiological comprehensiveness, the model is considered to consist of several compartments separately connected with a common compartment named 'plasma'. Plasma is the only accessible compartment and contains the state variables. Plasma variables are the integrated result of the net change in rates of metabolic processes and basal rates are influenced between two saturation constraints for an operating range of each plasma variable. The influence of a plasma variable on a metabolic rate is represented using a form of the hyperbolic tangent function. Validation is done by fitting the model with clinical experiments and continuous glucose monitoring data of a free-living environment. RESULTS: The proposed model generates an average correlation coefficient of 0.85 ± 0.13 on all simulated responses with the target in the fitting experiments. Besides this, the model can produce a spectrum of metabolic effects of plasma variables for showing more insight into glucose metabolism. CONCLUSIONS: A constraint-based comprehensive glucose regulation with exercise dynamics for modeling diabetes is pursued. The model doesn't consider age, gender, physical, and mental condition of the human body but can be applied in operation research by mathematical programming.
Assuntos
Diabetes Mellitus Tipo 2 , Diabetes Mellitus , Glicemia , Automonitorização da Glicemia , Exercício Físico , Humanos , InsulinaRESUMO
In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.