Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 174
Filtrar
1.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37225428

RESUMO

The prediction of drug-drug interactions (DDIs) is essential for the development and repositioning of new drugs. Meanwhile, they play a vital role in the fields of biopharmaceuticals, disease diagnosis and pharmacological treatment. This article proposes a new method called DBGRU-SE for predicting DDIs. Firstly, FP3 fingerprints, MACCS fingerprints, Pubchem fingerprints and 1D and 2D molecular descriptors are used to extract the feature information of the drugs. Secondly, Group Lasso is used to remove redundant features. Then, SMOTE-ENN is applied to balance the data to obtain the best feature vectors. Finally, the best feature vectors are fed into the classifier combining BiGRU and squeeze-and-excitation (SE) attention mechanisms to predict DDIs. After applying five-fold cross-validation, The ACC values of DBGRU-SE model on the two datasets are 97.51 and 94.98%, and the AUC are 99.60 and 98.85%, respectively. The results showed that DBGRU-SE had good predictive performance for drug-drug interactions.


Assuntos
Biologia Computacional , Interações Medicamentosas , Biologia Computacional/métodos
2.
BMC Bioinformatics ; 25(1): 18, 2024 Jan 11.
Artigo em Inglês | MEDLINE | ID: mdl-38212697

RESUMO

BACKGROUND: Metabolic syndrome (MetS) is a cluster of metabolic abnormalities (including obesity, insulin resistance, hypertension, and dyslipidemia), which can be used to identify at-risk populations for diabetes and cardiovascular diseases, the main causes of morbidity and mortality worldwide. The achievement of a simple approach for diagnosing MetS without needing biochemical tests is so valuable. The present study aimed to predict MetS using non-invasive features based on a successful random forest learning algorithm. Also, to deal with the problem of data imbalance that naturally exists in this type of data, the effect of two different data balancing approaches, including the Synthetic Minority Over-sampling Technique (SMOTE) and Random Splitting data balancing (SplitBal), on model performance is investigated. RESULTS: The most important determinant for MetS prediction was waist circumference. Applying a random forest learning algorithm to imbalanced data, the trained models reach 86.9% and 79.4% accuracies and 37.1% and 38.2% sensitivities in men and women, respectively. However, by applying the SplitBal data balancing technique, the best results were obtained, and despite that the accuracy of the trained models decreased by 7.8% and 11.3%, but their sensitivity improved significantly to 82.3% and 73.7% in men and women, respectively. CONCLUSIONS: The random forest learning method, along with data balancing techniques, especially SplitBal, could create MetS prediction models with promising results that can be applied as a useful prognostic tool in health screening programs.


Assuntos
Resistência à Insulina , Síndrome Metabólica , Masculino , Humanos , Feminino , Síndrome Metabólica/diagnóstico , Algoritmo Florestas Aleatórias , Fatores de Risco , Obesidade
3.
BMC Med Res Methodol ; 24(1): 123, 2024 Jun 03.
Artigo em Inglês | MEDLINE | ID: mdl-38831346

RESUMO

In contemporary society, depression has emerged as a prominent mental disorder that exhibits exponential growth and exerts a substantial influence on premature mortality. Although numerous research applied machine learning methods to forecast signs of depression. Nevertheless, only a limited number of research have taken into account the severity level as a multiclass variable. Besides, maintaining the equality of data distribution among all the classes rarely happens in practical communities. So, the inevitable class imbalance for multiple variables is considered a substantial challenge in this domain. Furthermore, this research emphasizes the significance of addressing class imbalance issues in the context of multiple classes. We introduced a new approach Feature group partitioning (FGP) in the data preprocessing phase which effectively reduces the dimensionality of features to a minimum. This study utilized synthetic oversampling techniques, specifically Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic (ADASYN), for class balancing. The dataset used in this research was collected from university students by administering the Burn Depression Checklist (BDC). For methodological modifications, we implemented heterogeneous ensemble learning stacking, homogeneous ensemble bagging, and five distinct supervised machine learning algorithms. The issue of overfitting was mitigated by evaluating the accuracy of the training, validation, and testing datasets. To justify the effectiveness of the prediction models, balanced accuracy, sensitivity, specificity, precision, and f1-score indices are used. Overall, comprehensive analysis demonstrates the discrimination between the Conventional Depression Screening (CDS) and FGP approach. In summary, the results show that the stacking classifier for FGP with SMOTE approach yields the highest balanced accuracy, with a rate of 92.81%. The empirical evidence has demonstrated that the FGP approach, when combined with the SMOTE, able to produce better performance in predicting the severity of depression. Most importantly the optimization of the training time of the FGP approach for all of the classifiers is a significant achievement of this research.


Assuntos
Algoritmos , Depressão , Aprendizado de Máquina , Humanos , Depressão/diagnóstico , Índice de Gravidade de Doença , Sensibilidade e Especificidade , Feminino
4.
Network ; : 1-22, 2024 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-38860469

RESUMO

Railway Point Machine (RPM) is a fundamental component of railway infrastructure and plays a crucial role in ensuring the safe operation of trains. Its primary function is to divert trains from one track to another, enabling connections between different lines and facilitating route selection. By judiciously deploying turnouts, railway systems can provide efficient transportation services while ensuring the safety of passengers and cargo. As signal processing technologies develop rapidly, taking the easy acquisition advantages of audio signals, a fault diagnosis method for RPMs is proposed by considering noise and multi-channel signals. The proposed method consists of several stages. Initially, the signal is subjected to pre-processing steps, including cropping and channel separation. Subsequently, the signal undergoes noise addition using the Random Length and Dynamic Position Noises Superposition (RDS) module, followed by conversion to a greyscale image. To enhance the data, Synthetic Minority Oversampling Technique (SMOTE) module is applied. Finally, the training data is fed into a Dual-input Attention Convolutional Neural Network (DIACNN). By employing various experimental techniques and designing diverse datasets, our proposed method demonstrates excellent robustness and achieves an outstanding classification accuracy of 99.73%.

5.
BMC Med Imaging ; 24(1): 176, 2024 Jul 19.
Artigo em Inglês | MEDLINE | ID: mdl-39030496

RESUMO

Medical imaging stands as a critical component in diagnosing various diseases, where traditional methods often rely on manual interpretation and conventional machine learning techniques. These approaches, while effective, come with inherent limitations such as subjectivity in interpretation and constraints in handling complex image features. This research paper proposes an integrated deep learning approach utilizing pre-trained models-VGG16, ResNet50, and InceptionV3-combined within a unified framework to improve diagnostic accuracy in medical imaging. The method focuses on lung cancer detection using images resized and converted to a uniform format to optimize performance and ensure consistency across datasets. Our proposed model leverages the strengths of each pre-trained network, achieving a high degree of feature extraction and robustness by freezing the early convolutional layers and fine-tuning the deeper layers. Additionally, techniques like SMOTE and Gaussian Blur are applied to address class imbalance, enhancing model training on underrepresented classes. The model's performance was validated on the IQ-OTH/NCCD lung cancer dataset, which was collected from the Iraq-Oncology Teaching Hospital/National Center for Cancer Diseases over a period of three months in fall 2019. The proposed model achieved an accuracy of 98.18%, with precision and recall rates notably high across all classes. This improvement highlights the potential of integrated deep learning systems in medical diagnostics, providing a more accurate, reliable, and efficient means of disease detection.


Assuntos
Aprendizado Profundo , Neoplasias Pulmonares , Humanos , Neoplasias Pulmonares/diagnóstico por imagem , Tomografia Computadorizada por Raios X/métodos , Redes Neurais de Computação
6.
BMC Med Inform Decis Mak ; 24(1): 172, 2024 Jun 19.
Artigo em Inglês | MEDLINE | ID: mdl-38898499

RESUMO

Hematoma expansion (HE) is a high risky symptom with high rate of occurrence for patients who have undergone spontaneous intracerebral hemorrhage (ICH) after a major accident or illness. Correct prediction of the occurrence of HE in advance is critical to help the doctors to determine the next step medical treatment. Most existing studies focus only on the occurrence of HE within 6 h after the occurrence of ICH, while in reality a considerable number of patients have HE after the first 6 h but within 24 h. In this study, based on the medical doctors recommendation, we focus on prediction of the occurrence of HE within 24 h, as well as the occurrence of HE every 6 h within 24 h. Based on the demographics and computer tomography (CT) image extraction information, we used the XGBoost method to predict the occurrence of HE within 24 h. In this study, to solve the issue of highly imbalanced data set, which is a frequent case in medical data analysis, we used the SMOTE algorithm for data augmentation. To evaluate our method, we used a data set consisting of 582 patients records, and compared the results of proposed method as well as few machine learning methods. Our experiments show that XGBoost achieved the best prediction performance on the balanced dataset processed by the SMOTE algorithm with an accuracy of 0.82 and F1-score of 0.82. Moreover, our proposed method predicts the occurrence of HE within 6, 12, 18 and 24 h at the accuracy of 0.89, 0.82, 0.87 and 0.94, indicating that the HE occurrence within 24 h can be predicted accurately by the proposed method.


Assuntos
Algoritmos , Hemorragia Cerebral , Hematoma , Humanos , Hemorragia Cerebral/diagnóstico por imagem , Hematoma/diagnóstico por imagem , Tomografia Computadorizada por Raios X , Masculino , Aprendizado de Máquina , Idoso , Pessoa de Meia-Idade , Feminino
7.
BMC Med Inform Decis Mak ; 24(1): 41, 2024 Feb 08.
Artigo em Inglês | MEDLINE | ID: mdl-38331788

RESUMO

In recent years, corneal refractive surgery has been widely used in clinics as an effective means to restore vision and improve the quality of life. When choosing myopia-refractive surgery, it is necessary to comprehensively consider the differences in equipment and technology as well as the specificity of individual patients, which heavily depend on the experience of ophthalmologists. In our study, we took advantage of machine learning to learn about the experience of ophthalmologists in decision-making and assist them in the choice of corneal refractive surgery in a new case. Our study was based on the clinical data of 7,081 patients who underwent corneal refractive surgery between 2000 and 2017 at the Department of Ophthalmology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences. Due to the long data period, there were data losses and errors in this dataset. First, we cleaned the data and deleted the samples of key data loss. Then, patients were divided into three groups according to the type of surgery, after which we used SMOTE technology to eliminate imbalance between groups. Six statistical machine learning models, including NBM, RF, AdaBoost, XGBoost, BP neural network, and DBN were selected, and a ten-fold cross-validation and grid search were used to determine the optimal hyperparameters for better performance. When tested on the dataset, the multi-class RF model showed the best performance, with agreement with ophthalmologist decisions as high as 0.8775 and Macro F1 as high as 0.8019. Furthermore, the results of the feature importance analysis based on the SHAP technique were consistent with an ophthalmologist's practical experience. Our research will assist ophthalmologists in choosing appropriate types of refractive surgery and will have beneficial clinical effects.


Assuntos
Miopia , Procedimentos Cirúrgicos Refrativos , Humanos , Acuidade Visual , Qualidade de Vida , Miopia/cirurgia , Aprendizado de Máquina
8.
BMC Med Inform Decis Mak ; 24(1): 142, 2024 May 27.
Artigo em Inglês | MEDLINE | ID: mdl-38802836

RESUMO

Lung cancer remains a leading cause of cancer-related mortality globally, with prognosis significantly dependent on early-stage detection. Traditional diagnostic methods, though effective, often face challenges regarding accuracy, early detection, and scalability, being invasive, time-consuming, and prone to ambiguous interpretations. This study proposes an advanced machine learning model designed to enhance lung cancer stage classification using CT scan images, aiming to overcome these limitations by offering a faster, non-invasive, and reliable diagnostic tool. Utilizing the IQ-OTHNCCD lung cancer dataset, comprising CT scans from various stages of lung cancer and healthy individuals, we performed extensive preprocessing including resizing, normalization, and Gaussian blurring. A Convolutional Neural Network (CNN) was then trained on this preprocessed data, and class imbalance was addressed using Synthetic Minority Over-sampling Technique (SMOTE). The model's performance was evaluated through metrics such as accuracy, precision, recall, F1-score, and ROC curve analysis. The results demonstrated a classification accuracy of 99.64%, with precision, recall, and F1-score values exceeding 98% across all categories. SMOTE significantly enhanced the model's ability to classify underrepresented classes, contributing to the robustness of the diagnostic tool. These findings underscore the potential of machine learning in transforming lung cancer diagnostics, providing high accuracy in stage classification, which could facilitate early detection and tailored treatment strategies, ultimately improving patient outcomes.


Assuntos
Neoplasias Pulmonares , Redes Neurais de Computação , Tomografia Computadorizada por Raios X , Humanos , Neoplasias Pulmonares/diagnóstico por imagem , Neoplasias Pulmonares/classificação , Aprendizado de Máquina , Processamento de Imagem Assistida por Computador/métodos , Aprendizado Profundo
9.
Sensors (Basel) ; 24(3)2024 Jan 23.
Artigo em Inglês | MEDLINE | ID: mdl-38339452

RESUMO

Advancements in sensing technology have expanded the capabilities of both wearable devices and smartphones, which are now commonly equipped with inertial sensors such as accelerometers and gyroscopes. Initially, these sensors were used for device feature advancement, but now, they can be used for a variety of applications. Human activity recognition (HAR) is an interesting research area that can be used for many applications like health monitoring, sports, fitness, medical purposes, etc. In this research, we designed an advanced system that recognizes different human locomotion and localization activities. The data were collected from raw sensors that contain noise. In the first step, we detail our noise removal process, which employs a Chebyshev type 1 filter to clean the raw sensor data, and then the signal is segmented by utilizing Hamming windows. After that, features were extracted for different sensors. To select the best feature for the system, the recursive feature elimination method was used. We then used SMOTE data augmentation techniques to solve the imbalanced nature of the Extrasensory dataset. Finally, the augmented and balanced data were sent to a long short-term memory (LSTM) deep learning classifier for classification. The datasets used in this research were Real-World Har, Real-Life Har, and Extrasensory. The presented system achieved 89% for Real-Life Har, 85% for Real-World Har, and 95% for the Extrasensory dataset. The proposed system outperforms the available state-of-the-art methods.


Assuntos
Exercício Físico , Dispositivos Eletrônicos Vestíveis , Humanos , Locomoção , Atividades Humanas , Reconhecimento Psicológico
10.
BMC Med Res Methodol ; 23(1): 101, 2023 04 22.
Artigo em Inglês | MEDLINE | ID: mdl-37087425

RESUMO

BACKGROUND: Trauma is one of the most critical public health issues worldwide, leading to death and disability and influencing all age groups. Therefore, there is great interest in models for predicting mortality in trauma patients admitted to the ICU. The main objective of the present study is to develop and evaluate SMOTE-based machine-learning tools for predicting hospital mortality in trauma patients with imbalanced data. METHODS: This retrospective cohort study was conducted on 126 trauma patients admitted to an intensive care unit at Besat hospital in Hamadan Province, western Iran, from March 2020 to March 2021. Data were extracted from the medical information records of patients. According to the imbalanced property of the data, SMOTE techniques, namely SMOTE, Borderline-SMOTE1, Borderline-SMOTE2, SMOTE-NC, and SVM-SMOTE, were used for primary preprocessing. Then, the Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Artificial Neural Network (ANN), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost) methods were used to predict patients' hospital mortality with traumatic injuries. The performance of the methods used was evaluated by sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), accuracy, Area Under the Curve (AUC), Geometric Mean (G-means), F1 score, and P-value of McNemar's test. RESULTS: Of the 126 patients admitted to an ICU, 117 (92.9%) survived and 9 (7.1%) died. The mean follow-up time from the date of trauma to the date of outcome was 3.98 ± 4.65 days. The performance of ML algorithms is not good with imbalanced data, whereas the performance of SMOTE-based ML algorithms is significantly improved. The mean area under the ROC curve (AUC) of all SMOTE-based models was more than 91%. F1-score and G-means before balancing the dataset were below 70% for all ML models except ANN. In contrast, F1-score and G-means for the balanced datasets reached more than 90% for all SMOTE-based models. Among all SMOTE-based ML methods, RF and ANN based on SMOTE and XGBoost based on SMOTE-NC achieved the highest value for all evaluation criteria. CONCLUSIONS: This study has shown that SMOTE-based ML algorithms better predict outcomes in traumatic injuries than ML algorithms. They have the potential to assist ICU physicians in making clinical decisions.


Assuntos
Algoritmos , Aprendizado de Máquina , Humanos , Mortalidade Hospitalar , Estudos Retrospectivos , Teorema de Bayes
11.
J Biomed Inform ; 147: 104526, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37852346

RESUMO

PURPOSE: Accurate prediction of the Length of Stay (LoS) and mortality in the Intensive Care Unit (ICU) is crucial for effective hospital management, and it can assist clinicians for real-time demand capacity (RTDC) administration, thereby improving healthcare quality and service levels. METHODS: This paper proposes a novel one-dimensional (1D) multi-scale convolutional neural network architecture, namely 1D-MSNet, to predict inpatients' LoS and mortality in ICU. First, a 1D multi-scale convolution framework is proposed to enlarge the convolutional receptive fields and enhance the richness of the convolutional features. Following the convolutional layers, an atrous causal spatial pyramid pooling (SPP) module is incorporated into the networks to extract high-level features. The optimized Focal Loss (FL) function is combined with the synthetic minority over-sampling technique (SMOTE) to mitigate the imbalanced-class issue. RESULTS: On the MIMIC-IV v1.0 benchmark dataset, the proposed approach achieves the optimum R-Square and RMSE values of 0.57 and 3.61 for the LoS prediction, and the highest test accuracy of 97.73% for the mortality prediction. CONCLUSION: The proposed approach presents a superior performance in comparison with other state-of-the-art, and it can effectively perform the LoS and mortality prediction tasks.


Assuntos
Aprendizado Profundo , Humanos , Tempo de Internação , Pacientes Internados , Redes Neurais de Computação , Unidades de Terapia Intensiva
12.
Sensors (Basel) ; 23(4)2023 Feb 06.
Artigo em Inglês | MEDLINE | ID: mdl-36850409

RESUMO

Selecting the best planting area for blueberries is an essential issue in agriculture. To better improve the effectiveness of blueberry cultivation, a machine learning-based classification model for blueberry ecological suitability was proposed for the first time and its validation was conducted by using multi-source environmental features data in this paper. The sparrow search algorithm (SSA) was adopted to optimize the CatBoost model and classify the ecological suitability of blueberries based on the selection of data features. Firstly, the Borderline-SMOTE algorithm was used to balance the number of positive and negative samples. The Variance Inflation Factor and information gain methods were applied to filter out the factors affecting the growth of blueberries. Subsequently, the processed data were fed into the CatBoost for training, and the parameters of the CatBoost were optimized to obtain the optimal model using SSA. Finally, the SSA-CatBoost model was adopted to classify the ecological suitability of blueberries and output the suitability types. Taking a study on a blueberry plantation in Majiang County, Guizhou Province, China as an example, the findings demonstrate that the AUC value of the SSA-CatBoost-based blueberry ecological suitability model is 0.921, which is 2.68% higher than that of the CatBoost (AUC = 0.897) and is significantly higher than Logistic Regression (AUC = 0.855), Support Vector Machine (AUC = 0.864), and Random Forest (AUC = 0.875). Furthermore, the ecological suitability of blueberries in Majiang County is mapped according to the classification results of different models. When comparing the actual blueberry cultivation situation in Majiang County, the classification results of the SSA-CatBoost model proposed in this paper matches best with the real blueberry cultivation situation in Majiang County, which is of a high reference value for the selection of blueberry cultivation sites.


Assuntos
Mirtilos Azuis (Planta) , Utensílios Domésticos , Agricultura , Algoritmos , China
13.
Sensors (Basel) ; 23(4)2023 Feb 08.
Artigo em Inglês | MEDLINE | ID: mdl-36850526

RESUMO

Due to the tremendous expectations placed on batteries to produce a reliable and secure product, fault detection has become a critical part of the manufacturing process. Manually, it takes much labor and effort to test each battery individually for manufacturing faults including burning, welding that is too high, missing welds, shifting, welding holes, and so forth. Additionally, manual battery fault detection takes too much time and is extremely expensive. We solved this issue by using image processing and machine learning techniques to automatically detect faults in the battery manufacturing process. Our approach will reduce the need for human intervention, save time, and be easy to implement. A CMOS camera was used to collect a large number of images belonging to eight common battery manufacturing faults. The welding area of the batteries' positive and negative terminals was captured from different distances, between 40 and 50 cm. Before deploying the learning models, first, we used the CNN for feature extraction from the image data. To over-sample the dataset, we used the Synthetic Minority Over-sampling Technique (SMOTE) since the dataset was highly imbalanced, resulting in over-fitting of the learning model. Several machine learning and deep learning models were deployed on the CNN-extracted features and over-sampled data. Random forest achieved a significant 84% accuracy with our proposed approach. Additionally, we applied K-fold cross-validation with the proposed approach to validate the significance of the approach, and the logistic regression achieved an 81.897% mean accuracy score and a +/- 0.0255 standard deviation.

14.
Sensors (Basel) ; 23(4)2023 Feb 11.
Artigo em Inglês | MEDLINE | ID: mdl-36850659

RESUMO

Adaptive machine learning has increasing importance due to its ability to classify a data stream and handle the changes in the data distribution. Various resources, such as wearable sensors and medical devices, can generate a data stream with an imbalanced distribution of classes. Many popular oversampling techniques have been designed for imbalanced batch data rather than a continuous stream. This work proposes a self-adjusting window to improve the adaptive classification of an imbalanced data stream based on minimizing cluster distortion. It includes two models; the first chooses only the previous data instances that preserve the coherence of the current chunk's samples. The second model relaxes the strict filter by excluding the examples of the last chunk. Both models include generating synthetic points for oversampling rather than the actual data points. The evaluation of the proposed models using the Siena EEG dataset showed their ability to improve the performance of several adaptive classifiers. The best results have been obtained using Adaptive Random Forest in which Sensitivity reached 96.83% and Precision reached 99.96%.

15.
Sensors (Basel) ; 24(1)2023 Dec 26.
Artigo em Inglês | MEDLINE | ID: mdl-38202990

RESUMO

In the context of 6G technology, the Internet of Everything aims to create a vast network that connects both humans and devices across multiple dimensions. The integration of smart healthcare, agriculture, transportation, and homes is incredibly appealing, as it allows people to effortlessly control their environment through touch or voice commands. Consequently, with the increase in Internet connectivity, the security risk also rises. However, the future is centered on a six-fold increase in connectivity, necessitating the development of stronger security measures to handle the rapidly expanding concept of IoT-enabled metaverse connections. Various types of attacks, often orchestrated using botnets, pose a threat to the performance of IoT-enabled networks. Detecting anomalies within these networks is crucial for safeguarding applications from potentially disastrous consequences. The voting classifier is a machine learning (ML) model known for its effectiveness as it capitalizes on the strengths of individual ML models and has the potential to improve overall predictive performance. In this research, we proposed a novel classification technique based on the DRX approach that combines the advantages of the Decision tree, Random forest, and XGBoost algorithms. This ensemble voting classifier significantly enhances the accuracy and precision of network intrusion detection systems. Our experiments were conducted using the NSL-KDD, UNSW-NB15, and CIC-IDS2017 datasets. The findings of our study show that the DRX-based technique works better than the others. It achieved a higher accuracy of 99.88% on the NSL-KDD dataset, 99.93% on the UNSW-NB15 dataset, and 99.98% on the CIC-IDS2017 dataset, outperforming the other methods. Additionally, there is a notable reduction in the false positive rates to 0.003, 0.001, and 0.00012 for the NSL-KDD, UNSW-NB15, and CIC-IDS2017 datasets.

16.
J Environ Manage ; 344: 118682, 2023 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-37567005

RESUMO

Machine learning (ML)-based urban waterlogging susceptibility studies suffer from class imbalance, as fewer positive samples are generally available than potential negative samples. Few studies have considered optimizing the results by improving the quality of training samples. To address this issue, we explored effective approaches to reliably increase the numbers of positive samples for such studies. The Synthetic Minority Over-Sampling Technique (SMOTE) and Optimized Seed Spread Algorithm (OSSA), representative of oversampling (synthesizing new samples based on the feature space) and physical (simulating potential inundated area based on the mechanisms of water flow) approaches, respectively, were employed to increase the number of positive samples. Waterlogging in Shenzhen was selected as a case study using eight selected spatial variables. An elaborate experiment was conducted to compare the quality of added samples based on the classifiers' performance and accuracy of waterlogging susceptibility maps (WSMs). The results indicated that (1) the performance of classifiers generated with SMOTE was worse than the original samples, while the use of OSSA improved the trained classifiers, and (2) the accuracy of WSMs was not improved with SMOTE but increased markedly with OSSA. These results may be driven by the diversity of information and features of the added samples. This study indicates the use of SMOTE fails to synthesize reliable samples when applied to waterlogging analysis in Shenzhen, whereas an effective solution for generating reliable positive samples is to use OSSA that simulates the potential submerged regions based on the mechanisms of disaster occurrence and spread.


Assuntos
Algoritmos , Desastres , Aprendizado de Máquina
17.
Molecules ; 28(4)2023 Feb 09.
Artigo em Inglês | MEDLINE | ID: mdl-36838652

RESUMO

The prediction of drug-target interactions (DTIs) is a vital step in drug discovery. The success of machine learning and deep learning methods in accurately predicting DTIs plays a huge role in drug discovery. However, when dealing with learning algorithms, the datasets used are usually highly dimensional and extremely imbalanced. To solve this issue, the dataset must be resampled accordingly. In this paper, we have compared several data resampling techniques to overcome class imbalance in machine learning methods as well as to study the effectiveness of deep learning methods in overcoming class imbalance in DTI prediction in terms of binary classification using ten (10) cancer-related activity classes from BindingDB. It is found that the use of Random Undersampling (RUS) in predicting DTIs severely affects the performance of a model, especially when the dataset is highly imbalanced, thus, rendering RUS unreliable. It is also found that SVM-SMOTE can be used as a go-to resampling method when paired with the Random Forest and Gaussian Naïve Bayes classifiers, whereby a high F1 score is recorded for all activity classes that are severely and moderately imbalanced. Additionally, the deep learning method called Multilayer Perceptron recorded high F1 scores for all activity classes even when no resampling method was applied.


Assuntos
Aprendizado Profundo , Teorema de Bayes , Aprendizado de Máquina , Redes Neurais de Computação , Algoritmos
18.
Appl Intell (Dordr) ; : 1-43, 2023 Feb 09.
Artigo em Inglês | MEDLINE | ID: mdl-36785593

RESUMO

Software Fault Prediction (SFP) is an important process to detect the faulty components of the software to detect faulty classes or faulty modules early in the software development life cycle. In this paper, a machine learning framework is proposed for SFP. Initially, pre-processing and re-sampling techniques are applied to make the SFP datasets ready to be used by ML techniques. Thereafter seven classifiers are compared, namely K-Nearest Neighbors (KNN), Naive Bayes (NB), Linear Discriminant Analysis (LDA), Linear Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and Random Forest (RF). The RF classifier outperforms all other classifiers in terms of eliminating irrelevant/redundant features. The performance of RF is improved further using a dimensionality reduction method called binary whale optimization algorithm (BWOA) to eliminate the irrelevant/redundant features. Finally, the performance of BWOA is enhanced by hybridizing the exploration strategies of the grey wolf optimizer (GWO) and harris hawks optimization (HHO) algorithms. The proposed method is called SBEWOA. The SFP datasets utilized are selected from the PROMISE repository using sixteen datasets for software projects with different sizes and complexity. The comparative evaluation against nine well-established feature selection methods proves that the proposed SBEWOA is able to significantly produce competitively superior results for several instances of the evaluated dataset. The algorithms' performance is compared in terms of accuracy, the number of features, and fitness function. This is also proved by the 2-tailed P-values of the Wilcoxon signed ranks statistical test used. In conclusion, the proposed method is an efficient alternative ML method for SFP that can be used for similar problems in the software engineering domain.

19.
Int J Inf Secur ; : 1-19, 2023 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-37360930

RESUMO

Along with the advancement of online platforms and significant growth in Internet usage, various threats and cyber-attacks have been emerging and become more complicated and perilous in a day-by-day base. Anomaly-based intrusion detection systems (AIDSs) are lucrative techniques for dealing with cybercrimes. As a relief, AIDS can be equipped with artificial intelligence techniques to validate traffic contents and tackle diverse illicit activities. A variety of methods have been proposed in the literature in recent years. Nevertheless, several important challenges like high false alarm rates, antiquated datasets, imbalanced data, insufficient preprocessing, lack of optimal feature subset, and low detection accuracy in different types of attacks have still remained to be solved. In order to alleviate these shortcomings, in this research a novel intrusion detection system that efficiently detects various types of attacks is proposed. In preprocessing, Smote-Tomek link algorithm is utilized to create balanced classes and produce a standard CICIDS dataset. The proposed system is based on gray wolf and Hunger Games Search (HGS) meta-heuristic algorithms to select feature subsets and detect different attacks such as distributed denial of services, Brute force, Infiltration, Botnet, and Port Scan. Also, to improve exploration and exploitation and boost the convergence speed, genetic algorithm operators are combined with standard algorithms. Using the proposed feature selection technique, more than 80 percent of irrelevant features are removed from the dataset. The behavior of the network is modeled using nonlinear quadratic regression and optimized utilizing the proposed hybrid HGS algorithm. The results show the superior performance of the hybrid algorithm of HGS compared to the baseline algorithms and the well-known research. As shown in the analogy, the proposed model obtained an average test accuracy rate of 99.17%, which has better performance than the baseline algorithm with 94.61% average accuracy.

20.
BMC Bioinformatics ; 23(1): 496, 2022 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-36401182

RESUMO

Classification of different cancer types is an essential step in designing a decision support model for early cancer predictions. Using various machine learning (ML) techniques with ensemble learning is one such method used for classifications. In the present study, various ML algorithms were explored on twenty exome datasets, belonging to 5 cancer types. Initially, a data clean-up was carried out on 4181 variants of cancer with 88 features, and a derivative dataset was obtained using natural language processing and probabilistic distribution. An exploratory dataset analysis using principal component analysis was then performed in 1 and 2D axes to reduce the high-dimensionality of the data. To significantly reduce the imbalance in the derivative dataset, oversampling was carried out using SMOTE. Further, classification algorithms such as K-nearest neighbour and support vector machine were used initially on the oversampled dataset. A 4-layer artificial neural network model with 1D batch normalization was also designed to improve the model accuracy. Ensemble ML techniques such as bagging along with using KNN, SVM and MLPs as base classifiers to improve the weighted average performance metrics of the model. However, due to small sample size, model improvement was challenging. Therefore, a novel method to augment the sample size using generative adversarial network (GAN) and triplet based variational auto encoder (TVAE) was employed that reconstructed the features and labels generating the data. The results showed that from initial scrutiny, KNN showed a weighted average of 0.74 and SVM 0.76. Oversampling ensured that the accuracy of the derivative dataset improved significantly and the ensemble classifier augmented the accuracy to 82.91%, when the data was divided into 70:15:15 ratio (training, test and holdout datasets). The overall evaluation metric value when GAN and TVAE increased the sample size was found to be 0.92 with an overall comparison model of 0.66. Therefore, the present study designed an effective model for classifying cancers which when implemented to real world samples, will play a major role in early cancer diagnosis.


Assuntos
Exoma , Neoplasias , Humanos , Exoma/genética , Detecção Precoce de Câncer , Aprendizado de Máquina , Algoritmos , Neoplasias/diagnóstico , Neoplasias/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA