Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 200
Filter
Add more filters

Publication year range
1.
Cereb Cortex ; 34(13): 72-83, 2024 May 02.
Article in English | MEDLINE | ID: mdl-38696605

ABSTRACT

Autism spectrum disorder has been emerging as a growing public health threat. Early diagnosis of autism spectrum disorder is crucial for timely, effective intervention and treatment. However, conventional diagnosis methods based on communications and behavioral patterns are unreliable for children younger than 2 years of age. Given evidences of neurodevelopmental abnormalities in autism spectrum disorder infants, we resort to a novel deep learning-based method to extract key features from the inherently scarce, class-imbalanced, and heterogeneous structural MR images for early autism diagnosis. Specifically, we propose a Siamese verification framework to extend the scarce data, and an unsupervised compressor to alleviate data imbalance by extracting key features. We also proposed weight constraints to cope with sample heterogeneity by giving different samples different voting weights during validation, and used Path Signature to unravel meaningful developmental features from the two-time point data longitudinally. We further extracted machine learning focused brain regions for autism diagnosis. Extensive experiments have shown that our method performed well under practical scenarios, transcending existing machine learning methods and providing anatomical insights for autism early diagnosis.


Subject(s)
Autism Spectrum Disorder , Brain , Deep Learning , Early Diagnosis , Humans , Autism Spectrum Disorder/diagnostic imaging , Autism Spectrum Disorder/diagnosis , Infant , Brain/diagnostic imaging , Brain/pathology , Magnetic Resonance Imaging/methods , Child, Preschool , Male , Female , Autistic Disorder/diagnosis , Autistic Disorder/diagnostic imaging , Autistic Disorder/pathology , Unsupervised Machine Learning
2.
BMC Med Res Methodol ; 24(1): 145, 2024 Jul 05.
Article in English | MEDLINE | ID: mdl-38970036

ABSTRACT

BACKGROUND: In binary classification for clinical studies, an imbalanced distribution of cases to classes and an extreme association level between the binary dependent variable and a subset of independent variables can create significant classification problems. These crucial issues, namely class imbalance and complete separation, lead to classification inaccuracy and biased results in clinical studies. METHOD: To deal with class imbalance and complete separation problems, we propose using a fuzzy logistic regression framework for binary classification. Fuzzy logistic regression incorporates combinations of triangular fuzzy numbers for the coefficients, inputs, and outputs and produces crisp classification results. The fuzzy logistic regression framework shows strong classification performance due to fuzzy logic's better handling of imbalance and separation issues. Hence, classification accuracy is improved, mitigating the risk of misclassified conditions and biased insights for clinical study patients. RESULTS: The performance of the fuzzy logistic regression model is assessed on twelve binary classification problems with clinical datasets. The model has consistently high sensitivity, specificity, F1, precision, and Mathew's correlation coefficient scores across all clinical datasets. There is no evidence of impact from the imbalance or separation that exists in the datasets. Furthermore, we compare the fuzzy logistic regression classification performance against two versions of classical logistic regression and six different benchmark sources in the literature. These six sources provide a total of ten different proposed methodologies, and the comparison occurs by calculating the same set of classification performance scores for each method. Either imbalance or separation impacts seven out of ten methodologies. The remaining three produce better classification performance in their respective clinical studies. However, these are all outperformed by the fuzzy logistic regression framework. CONCLUSION: Fuzzy logistic regression showcases strong performance against imbalance and separation, providing accurate predictions and, hence, informative insights for classifying patients in clinical studies.


Subject(s)
Fuzzy Logic , Humans , Logistic Models , Algorithms
3.
Article in English | MEDLINE | ID: mdl-38744680

ABSTRACT

BACKGROUND AND AIM: Risk assessment is of paramount importance for the detection and treatment of colorectal cancer. We developed and validated a feature interpretability screening framework to identify high-risk populations and recommend colonoscopy for them. METHODS: We utilized a training cohort consisting of 1 252 605 participants who underwent colonoscopies in Shanghai from 2013 to 2015 to develop the screening framework. We incorporated Shapley additive explanation values into feature selection to provide interpretability for the framework. Two sampling methods were separately employed to mitigate potential model bias caused by class imbalance. Furthermore, we employed various machine learning algorithms to construct risk assessment models and compared their performance. We tested the screening models on an external validation cohort of 359 462 samples and conducted comprehensive evaluation and statistical analysis of the validation results. RESULTS: The external validation results demonstrated that the models in the proposed framework achieved sensitivity over 0.734, specificity over 0.790, and area under the receiver operating characteristic curve ranging from 0.808 to 0.859. In the predictions of the best-performing model, the prevalence rates of colorectal cancer were 0.059% and 1.056% in the low- and high-risk groups, respectively. If colonoscopies were performed only on the high-risk group predicted by the model, only 14.36% of total colonoscopies would be needed to detect 74.86% of colorectal cancer cases. CONCLUSIONS: We developed and validated a novel framework to identify populations at high risk for colorectal cancer. Those classified as high risk should undergo colonoscopy for further diagnosis.

4.
J Biomed Inform ; 155: 104666, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38848886

ABSTRACT

OBJECTIVE: Class imbalance is sometimes considered a problem when developing clinical prediction models and assessing their performance. To address it, correction strategies involving manipulations of the training dataset, such as random undersampling or oversampling, are frequently used. The aim of this article is to illustrate the consequences of these class imbalance correction strategies on clinical prediction models' internal validity in terms of calibration and discrimination performances. METHODS: We used both heuristic intuition and formal mathematical reasoning to characterize the relations between conditional probabilities of interest and probabilities targeted when using random undersampling or oversampling. We propose a plug-in estimator that represents a natural correction for predictions obtained from models that have been trained on artificially balanced datasets ("naïve" models). We conducted a Monte Carlo simulation with two different data generation processes and present a real-world example using data from the International Stroke Trial database to empirically demonstrate the consequences of applying random resampling techniques for class imbalance correction on calibration and discrimination (in terms of Area Under the ROC, AUC) for logistic regression and tree-based prediction models. RESULTS: Across our simulations and in the real-world example, calibration of the naïve models was very poor. The models using the plug-in estimator generally outperformed the models relying on class imbalance correction in terms of calibration while achieving the same discrimination performance. CONCLUSION: Random resampling techniques for class imbalance correction do not generally improve discrimination performance (i.e., AUC), and their use is hard to justify when aiming at providing calibrated predictions. Improper use of such class imbalance correction techniques can lead to suboptimal data usage and less valid risk prediction models.


Subject(s)
Monte Carlo Method , Humans , Calibration , ROC Curve , Models, Statistical , Area Under Curve , Computer Simulation , Logistic Models , Algorithms , Risk Assessment/methods
5.
J Biomed Inform ; 149: 104532, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38070817

ABSTRACT

INTRODUCTION: Risk prediction, including early disease detection, prevention, and intervention, is essential to precision medicine. However, systematic bias in risk estimation caused by heterogeneity across different demographic groups can lead to inappropriate or misinformed treatment decisions. In addition, low incidence (class-imbalance) outcomes negatively impact the classification performance of many standard learning algorithms which further exacerbates the racial disparity issues. Therefore, it is crucial to improve the performance of statistical and machine learning models in underrepresented populations in the presence of heavy class imbalance. METHOD: To address demographic disparity in the presence of class imbalance, we develop a novel framework, Trans-Balance, by leveraging recent advances in imbalance learning, transfer learning, and federated learning. We consider a practical setting where data from multiple sites are stored locally under privacy constraints. RESULTS: We show that the proposed Trans-Balance framework improves upon existing approaches by explicitly accounting for heterogeneity across demographic subgroups and cohorts. We demonstrate the feasibility and validity of our methods through numerical experiments and a real application to a multi-cohort study with data from participants of four large, NIH-funded cohorts for stroke risk prediction. CONCLUSION: Our findings indicate that the Trans-Balance approach significantly improves predictive performance, especially in scenarios marked by severe class imbalance and demographic disparity. Given its versatility and effectiveness, Trans-Balance offers a valuable contribution to enhancing risk prediction in biomedical research and related fields.


Subject(s)
Algorithms , Biomedical Research , Humans , Cohort Studies , Machine Learning , Demography
6.
Proc Natl Acad Sci U S A ; 118(43)2021 10 26.
Article in English | MEDLINE | ID: mdl-34675075

ABSTRACT

In this paper, we introduce the Layer-Peeled Model, a nonconvex, yet analytically tractable, optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts of the network. We demonstrate that the Layer-Peeled Model, albeit simple, inherits many characteristics of well-trained neural networks, thereby offering an effective tool for explaining and predicting common empirical patterns of deep-learning training. First, when working on class-balanced datasets, we prove that any solution to this model forms a simplex equiangular tight frame, which, in part, explains the recently discovered phenomenon of neural collapse [V. Papyan, X. Y. Han, D. L. Donoho, Proc. Natl. Acad. Sci. U.S.A. 117, 24652-24663 (2020)]. More importantly, when moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto-unknown phenomenon that we term Minority Collapse, which fundamentally limits the performance of deep-learning models on the minority classes. In addition, we use the Layer-Peeled Model to gain insights into how to mitigate Minority Collapse. Interestingly, this phenomenon is first predicted by the Layer-Peeled Model before being confirmed by our computational experiments.


Subject(s)
Deep Learning , Neural Networks, Computer , Computer Heuristics , Databases, Factual/statistics & numerical data , Humans , Nonlinear Dynamics , Stochastic Processes
7.
BMC Med Inform Decis Mak ; 24(1): 90, 2024 Mar 28.
Article in English | MEDLINE | ID: mdl-38549123

ABSTRACT

Class imbalance remains a large problem in high-throughput omics analyses, causing bias towards the over-represented class when training machine learning-based classifiers. Oversampling is a common method used to balance classes, allowing for better generalization of the training data. More naive approaches can introduce other biases into the data, being especially sensitive to inaccuracies in the training data, a problem considering the characteristically noisy data obtained in healthcare. This is especially a problem with high-dimensional data. A generative adversarial network-based method is proposed for creating synthetic samples from small, high-dimensional data, to improve upon other more naive generative approaches. The method was compared with 'synthetic minority over-sampling technique' (SMOTE) and 'random oversampling' (RO). Generative methods were validated by training classifiers on the balanced data.


Subject(s)
Machine Learning , Bias
8.
Sensors (Basel) ; 24(4)2024 Feb 12.
Article in English | MEDLINE | ID: mdl-38400357

ABSTRACT

Parkinson's disease (PD) is the second most prevalent dementia in the world. Wearable technology has been useful in the computer-aided diagnosis and long-term monitoring of PD in recent years. The fundamental issue remains how to assess the severity of PD using wearable devices in an efficient and accurate manner. However, in the real-world free-living environment, there are two difficult issues, poor annotation and class imbalance, both of which could potentially impede the automatic assessment of PD. To address these challenges, we propose a novel framework for assessing the severity of PD patient's in a free-living environment. Specifically, we use clustering methods to learn latent categories from the same activities, while latent Dirichlet allocation (LDA) topic models are utilized to capture latent features from multiple activities. Then, to mitigate the impact of data imbalance, we augment bag-level data while retaining key instance prototypes. To comprehensively demonstrate the efficacy of our proposed framework, we collected a dataset containing wearable-sensor signals from 83 individuals in real-life free-living conditions. The experimental results show that our framework achieves an astounding 73.48% accuracy in the fine-grained (normal, mild, moderate, severe) classification of PD severity based on hand movements. Overall, this study contributes to more accurate PD self-diagnosis in the wild, allowing doctors to provide remote drug intervention guidance.


Subject(s)
Parkinson Disease , Wearable Electronic Devices , Humans , Parkinson Disease/diagnosis , Movement , Severity of Illness Index , Upper Extremity
9.
Sensors (Basel) ; 24(12)2024 Jun 16.
Article in English | MEDLINE | ID: mdl-38931682

ABSTRACT

Monitoring activities of daily living (ADLs) plays an important role in measuring and responding to a person's ability to manage their basic physical needs. Effective recognition systems for monitoring ADLs must successfully recognize naturalistic activities that also realistically occur at infrequent intervals. However, existing systems primarily focus on either recognizing more separable, controlled activity types or are trained on balanced datasets where activities occur more frequently. In our work, we investigate the challenges associated with applying machine learning to an imbalanced dataset collected from a fully in-the-wild environment. This analysis shows that the combination of preprocessing techniques to increase recall and postprocessing techniques to increase precision can result in more desirable models for tasks such as ADL monitoring. In a user-independent evaluation using in-the-wild data, these techniques resulted in a model that achieved an event-based F1-score of over 0.9 for brushing teeth, combing hair, walking, and washing hands. This work tackles fundamental challenges in machine learning that will need to be addressed in order for these systems to be deployed and reliably work in the real world.


Subject(s)
Activities of Daily Living , Human Activities , Machine Learning , Humans , Algorithms , Walking/physiology , Pattern Recognition, Automated/methods
10.
Int J Mol Sci ; 25(4)2024 Feb 09.
Article in English | MEDLINE | ID: mdl-38396779

ABSTRACT

Cancer is a leading cause of death globally. The majority of cancer cases are only diagnosed in the late stages of cancer due to the use of conventional methods. This reduces the chance of survival for cancer patients. Therefore, early detection consequently followed by early diagnoses are important tasks in cancer research. Gene expression microarray technology has been applied to detect and diagnose most types of cancers in their early stages and has gained encouraging results. In this paper, we address the problem of classifying cancer based on gene expression for handling the class imbalance problem and the curse of dimensionality. The oversampling technique is utilized to overcome this problem by adding synthetic samples. Another common issue related to the gene expression dataset addressed in this paper is the curse of dimensionality. This problem is addressed by applying chi-square and information gain feature selection techniques. After applying these techniques individually, we proposed a method to select the most significant genes by combining those two techniques (CHiS and IG). We investigated the effect of these techniques individually and in combination. Four benchmarking biomedical datasets (Leukemia-subtypes, Leukemia-ALLAML, Colon, and CuMiDa) were used. The experimental results reveal that the oversampling techniques improve the results in most cases. Additionally, the performance of the proposed feature selection technique outperforms individual techniques in nearly all cases. In addition, this study provides an empirical study for evaluating several oversampling techniques along with ensemble-based learning. The experimental results also reveal that SVM-SMOTE, along with the random forests classifier, achieved the highest results, with a reporting accuracy of 100%. The obtained results surpass the findings in the existing literature as well.


Subject(s)
Leukemia , Neoplasms , Humans , Neoplasms/genetics , Leukemia/genetics , Gene Expression
11.
BMC Oral Health ; 24(1): 161, 2024 Feb 01.
Article in English | MEDLINE | ID: mdl-38302981

ABSTRACT

BACKGROUND: Oral potentially malignant disorders (OPMDs) are associated with an increased risk of cancer of the oral cavity including the tongue. The early detection of oral cavity cancers and OPMDs is critical for reducing cancer-specific morbidity and mortality. Recently, there have been studies to apply the rapidly advancing technology of deep learning for diagnosing oral cavity cancer and OPMDs. However, several challenging issues such as class imbalance must be resolved to effectively train a deep learning model for medical imaging classification tasks. The aim of this study is to evaluate a new technique of artificial intelligence to improve the classification performance in an imbalanced tongue lesion dataset. METHODS: A total of 1,810 tongue images were used for the classification. The class-imbalanced dataset consisted of 372 instances of cancer, 141 instances of OPMDs, and 1,297 instances of noncancerous lesions. The EfficientNet model was used as the feature extraction model for classification. Mosaic data augmentation, soft labeling, and curriculum learning (CL) were employed to improve the classification performance of the convolutional neural network. RESULTS: Utilizing a mosaic-augmented dataset in conjunction with CL, the final model achieved an accuracy rate of 0.9444, surpassing conventional oversampling and weight balancing methods. The relative precision improvement rate for the minority class OPMD was 21.2%, while the relative [Formula: see text] score improvement rate of OPMD was 4.9%. CONCLUSIONS: The present study demonstrates that the integration of mosaic-based soft labeling and curriculum learning improves the classification performance of tongue lesions compared to previous methods, establishing a foundation for future research on effectively learning from imbalanced data.


Subject(s)
Deep Learning , Mouth Neoplasms , Humans , Artificial Intelligence , Curriculum , Tongue
12.
BMC Bioinformatics ; 24(1): 38, 2023 Feb 03.
Article in English | MEDLINE | ID: mdl-36737694

ABSTRACT

BACKGROUND: The experimental verification of a drug discovery process is expensive and time-consuming. Therefore, efficiently and effectively identifying drug-target interactions (DTIs) has been the focus of research. At present, many machine learning algorithms are used for predicting DTIs. The key idea is to train the classifier using an existing DTI to predict a new or unknown DTI. However, there are various challenges, such as class imbalance and the parameter optimization of many classifiers, that need to be solved before an optimal DTI model is developed. METHODS: In this study, we propose a framework called SSELM-neg for DTI prediction, in which we use a screening approach to choose high-quality negative samples and a spherical search approach to optimize the parameters of the extreme learning machine. RESULTS: The results demonstrated that the proposed technique outperformed other state-of-the-art methods in 10-fold cross-validation experiments in terms of the area under the receiver operating characteristic curve (0.986, 0.993, 0.988, and 0.969) and AUPR (0.982, 0.991, 0.982, and 0.946) for the enzyme dataset, G-protein coupled receptor dataset, ion channel dataset, and nuclear receptor dataset, respectively. CONCLUSION: The screening approach produced high-quality negative samples with the same number of positive samples, which solved the class imbalance problem. We optimized an extreme learning machine using a spherical search approach to identify DTIs. Therefore, our models performed better than other state-of-the-art methods.


Subject(s)
Drug Development , Drug Discovery , Drug Discovery/methods , Machine Learning , Algorithms , Drug Interactions
13.
Neuroimage ; 277: 120253, 2023 08 15.
Article in English | MEDLINE | ID: mdl-37385392

ABSTRACT

Machine learning (ML) is increasingly used in cognitive, computational and clinical neuroscience. The reliable and efficient application of ML requires a sound understanding of its subtleties and limitations. Training ML models on datasets with imbalanced classes is a particularly common problem, and it can have severe consequences if not adequately addressed. With the neuroscience ML user in mind, this paper provides a didactic assessment of the class imbalance problem and illustrates its impact through systematic manipulation of data imbalance ratios in (i) simulated data and (ii) brain data recorded with electroencephalography (EEG), magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI). Our results illustrate how the widely-used Accuracy (Acc) metric, which measures the overall proportion of successful predictions, yields misleadingly high performances, as class imbalance increases. Because Acc weights the per-class ratios of correct predictions proportionally to class size, it largely disregards the performance on the minority class. A binary classification model that learns to systematically vote for the majority class will yield an artificially high decoding accuracy that directly reflects the imbalance between the two classes, rather than any genuine generalizable ability to discriminate between them. We show that other evaluation metrics such as the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC), and the less common Balanced Accuracy (BAcc) metric - defined as the arithmetic mean between sensitivity and specificity, provide more reliable performance evaluations for imbalanced data. Our findings also highlight the robustness of Random Forest (RF), and the benefits of using stratified cross-validation and hyperprameter optimization to tackle data imbalance. Critically, for neuroscience ML applications that seek to minimize overall classification error, we recommend the routine use of BAcc, which in the specific case of balanced data is equivalent to using standard Acc, and readily extends to multi-class settings. Importantly, we present a list of recommendations for dealing with imbalanced data, as well as open-source code to allow the neuroscience community to replicate and extend our observations and explore alternative approaches to coping with imbalanced data.


Subject(s)
Benchmarking , Brain , Humans , Magnetoencephalography , Machine Learning , Electroencephalography , Algorithms
14.
Environ Sci Technol ; 57(49): 20636-20646, 2023 Dec 12.
Article in English | MEDLINE | ID: mdl-38011382

ABSTRACT

Cyanobacterial harmful algal blooms (CyanoHABs) pose serious risks to inland water resources. Despite advancements in our understanding of associated environmental factors and modeling efforts, predicting CyanoHABs remains challenging. Leveraging an integrated water quality data collection effort in Iowa lakes, this study aimed to identify factors associated with hazardous microcystin levels and develop one-week-ahead predictive classification models. Using water samples from 38 Iowa lakes collected between 2018 and 2021, feature selection was conducted considering both linear and nonlinear properties. Subsequently, we developed three model types (Neural Network, XGBoost, and Logistic Regression) with different sampling strategies using the nine selected variables (mcyA_M, TKN, % hay/pasture, pH, mcyA_M:16S, % developed, DOC, dewpoint temperature, and ortho-P). Evaluation metrics demonstrated the strong performance of the Neural Network with oversampling (ROC-AUC 0.940, accuracy 0.861, sensitivity 0.857, specificity 0.857, LR+ 5.993, and 1/LR- 5.993), as well as the XGBoost with downsampling (ROC-AUC 0.944, accuracy 0.831, sensitivity 0.928, specificity 0.833, LR+ 5.557, and 1/LR- 11.569). This study exhibited the intricacies of modeling with limited data and class imbalances, underscoring the importance of continuous monitoring and data collection to improve predictive accuracy. Also, the methodologies employed can serve as meaningful references for researchers tackling similar challenges in diverse environments.


Subject(s)
Cyanobacteria , Harmful Algal Bloom , Lakes/microbiology , Iowa
15.
Lang Resour Eval ; 57(4): 1515-1546, 2023.
Article in English | MEDLINE | ID: mdl-38021031

ABSTRACT

The goal of hate speech detection is to filter negative online content aiming at certain groups of people. Due to the easy accessibility and multilinguality of social media platforms, it is crucial to protect everyone which requires building hate speech detection systems for a wide range of languages. However, the available labeled hate speech datasets are limited, making it difficult to build systems for many languages. In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages, while highlighting label issues across application scenarios, such as inconsistent label sets of corpora or differing hate speech definitions, which hinder the application of such methods. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply them to the target language, which lacks labeled examples, and show that good performance can be achieved. We then incorporate unlabeled target language data for further model improvements by bootstrapping labels using an ensemble of different model architectures. Furthermore, we investigate the issue of label imbalance in hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance. We test simple data undersampling and oversampling techniques and show their effectiveness.

16.
BMC Public Health ; 23(1): 2164, 2023 11 06.
Article in English | MEDLINE | ID: mdl-37932692

ABSTRACT

BACKGROUND: Since the inconspicuous nature of early signs associated with Chronic Obstructive Pulmonary Disease (COPD), individuals often remain unidentified, leading to suboptimal opportunities for timely prevention and treatment. The purpose of this study was to create an explainable artificial intelligence framework combining data preprocessing methods, machine learning methods, and model interpretability methods to identify people at high risk of COPD in the smoking population and to provide a reasonable interpretation of model predictions. METHODS: The data comprised questionnaire information, physical examination data and results of pulmonary function tests before and after bronchodilatation. First, the factorial analysis for mixed data (FAMD), Boruta and NRSBoundary-SMOTE resampling methods were used to solve the missing data, high dimensionality and category imbalance problems. Then, seven classification models (CatBoost, NGBoost, XGBoost, LightGBM, random forest, SVM and logistic regression) were applied to model the risk level, and the best machine learning (ML) model's decisions were explained using the Shapley additive explanations (SHAP) method and partial dependence plot (PDP). RESULTS: In the smoking population, age and 14 other variables were significant factors for predicting COPD. The CatBoost, random forest, and logistic regression models performed reasonably well in unbalanced datasets. CatBoost with NRSBoundary-SMOTE had the best classification performance in balanced datasets when composite indicators (the AUC, F1-score, and G-mean) were used as model comparison criteria. Age, COPD Assessment Test (CAT) score, gross annual income, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), anhelation, respiratory disease, central obesity, use of polluting fuel for household heating, region, use of polluting fuel for household cooking, and wheezing were important factors for predicting COPD in the smoking population. CONCLUSION: This study combined feature screening methods, unbalanced data processing methods, and advanced machine learning methods to enable early identification of COPD risk groups in the smoking population. COPD risk factors in the smoking population were identified using SHAP and PDP, with the goal of providing theoretical support for targeted screening strategies and smoking population self-management strategies.


Subject(s)
Pulmonary Disease, Chronic Obstructive , Smokers , Humans , Adolescent , Artificial Intelligence , Tobacco Smoking , Smoking
17.
Sensors (Basel) ; 23(3)2023 Jan 19.
Article in English | MEDLINE | ID: mdl-36772192

ABSTRACT

Due to the distributed data collection and learning in federated learnings, many clients conduct local training with non-independent and identically distributed (non-IID) datasets. Accordingly, the training from these datasets results in severe performance degradation. We propose an efficient algorithm for enhancing the performance of federated learning by overcoming the negative effects of non-IID datasets. First, the intra-client class imbalance is reduced by rendering the class distribution of clients close to Uniform distribution. Second, the clients to participate in federated learning are selected to make their integrated class distribution close to Uniform distribution for the purpose of mitigating the inter-client class imbalance, which represents the class distribution difference among clients. In addition, the amount of local training data for the selected clients is finely adjusted. Finally, in order to increase the efficiency of federated learning, the batch size and the learning rate of local training for the selected clients are dynamically controlled reflecting the effective size of the local dataset for each client. In the performance evaluation on CIFAR-10 and MNIST datasets, the proposed algorithm achieves 20% higher accuracy than existing federated learning algorithms. Moreover, in achieving this huge accuracy improvement, the proposed algorithm uses less computation and communication resources compared to existing algorithms in terms of the amount of data used and the number of clients joined in the training.

18.
Sensors (Basel) ; 23(4)2023 Feb 07.
Article in English | MEDLINE | ID: mdl-36850460

ABSTRACT

Surface defect identification based on computer vision algorithms often leads to inadequate generalization ability due to large intraclass variation. Diversity in lighting conditions, noise components, defect size, shape, and position make the problem challenging. To solve the problem, this paper develops a pixel-level image augmentation method that is based on image-to-image translation with generative adversarial neural networks (GANs) conditioned on fine-grained labels. The GAN model proposed in this work, referred to as Magna-Defect-GAN, is capable of taking control of the image generation process and producing image samples that are highly realistic in terms of variations. Firstly, the surface defect dataset based on the magnetic particle inspection (MPI) method is acquired in a controlled environment. Then, the Magna-Defect-GAN model is trained, and new synthetic image samples with large intraclass variations are generated. These synthetic image samples artificially inflate the training dataset size in terms of intraclass diversity. Finally, the enlarged dataset is used to train a defect identification model. Experimental results demonstrate that the Magna-Defect-GAN model can generate realistic and high-resolution surface defect images up to the resolution of 512 × 512 in a controlled manner. We also show that this augmentation method can boost accuracy and be easily adapted to any other surface defect identification models.

19.
Sensors (Basel) ; 23(1)2023 Jan 03.
Article in English | MEDLINE | ID: mdl-36617148

ABSTRACT

In modern networks, a Network Intrusion Detection System (NIDS) is a critical security device for detecting unauthorized activity. The categorization effectiveness for minority classes is limited by the imbalanced class issues connected with the dataset. We propose an Imbalanced Generative Adversarial Network (IGAN) to address the problem of class imbalance by increasing the detection rate of minority classes while maintaining efficiency. To limit the effect of the minimum or maximum value on the overall features, the original data was normalized and one-hot encoded using data preprocessing. To address the issue of the low detection rate of minority attacks caused by the imbalance in the training data, we enrich the minority samples with IGAN. The ensemble of Lenet 5 and Long Short Term Memory (LSTM) is used to classify occurrences that are considered abnormal into various attack categories. The investigational findings demonstrate that the proposed approach outperforms the other deep learning approaches, achieving the best accuracy, precision, recall, TPR, FPR, and F1-score. The findings indicate that IGAN oversampling can enhance the detection rate of minority samples, hence improving overall accuracy. According to the data, the recommended technique valued performance measures far more than alternative approaches. The proposed method is found to achieve above 98% accuracy and classifies various attacks significantly well as compared to other classifiers.


Subject(s)
Memory, Long-Term
20.
Molecules ; 28(4)2023 Feb 09.
Article in English | MEDLINE | ID: mdl-36838652

ABSTRACT

The prediction of drug-target interactions (DTIs) is a vital step in drug discovery. The success of machine learning and deep learning methods in accurately predicting DTIs plays a huge role in drug discovery. However, when dealing with learning algorithms, the datasets used are usually highly dimensional and extremely imbalanced. To solve this issue, the dataset must be resampled accordingly. In this paper, we have compared several data resampling techniques to overcome class imbalance in machine learning methods as well as to study the effectiveness of deep learning methods in overcoming class imbalance in DTI prediction in terms of binary classification using ten (10) cancer-related activity classes from BindingDB. It is found that the use of Random Undersampling (RUS) in predicting DTIs severely affects the performance of a model, especially when the dataset is highly imbalanced, thus, rendering RUS unreliable. It is also found that SVM-SMOTE can be used as a go-to resampling method when paired with the Random Forest and Gaussian Naïve Bayes classifiers, whereby a high F1 score is recorded for all activity classes that are severely and moderately imbalanced. Additionally, the deep learning method called Multilayer Perceptron recorded high F1 scores for all activity classes even when no resampling method was applied.


Subject(s)
Deep Learning , Bayes Theorem , Machine Learning , Neural Networks, Computer , Algorithms
SELECTION OF CITATIONS
SEARCH DETAIL