Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
Physiol Meas ; 45(5)2024 May 21.
Article in English | MEDLINE | ID: mdl-38697206

ABSTRACT

Objective.Myocarditis poses a significant health risk, often precipitated by viral infections like coronavirus disease, and can lead to fatal cardiac complications. As a less invasive alternative to the standard diagnostic practice of endomyocardial biopsy, which is highly invasive and thus limited to severe cases, cardiac magnetic resonance (CMR) imaging offers a promising solution for detecting myocardial abnormalities.Approach.This study introduces a deep model called ELRL-MD that combines ensemble learning and reinforcement learning (RL) for effective myocarditis diagnosis from CMR images. The model begins with pre-training via the artificial bee colony (ABC) algorithm to enhance the starting point for learning. An array of convolutional neural networks (CNNs) then works in concert to extract and integrate features from CMR images for accurate diagnosis. Leveraging the Z-Alizadeh Sani myocarditis CMR dataset, the model employs RL to navigate the dataset's imbalance by conceptualizing diagnosis as a decision-making process.Main results.ELRL-DM demonstrates remarkable efficacy, surpassing other deep learning, conventional machine learning, and transfer learning models, achieving an F-measure of 88.2% and a geometric mean of 90.6%. Extensive experimentation helped pinpoint the optimal reward function settings and the perfect count of CNNs.Significance.The study addresses the primary technical challenge of inherent data imbalance in CMR imaging datasets and the risk of models converging on local optima due to suboptimal initial weight settings. Further analysis, leaving out ABC and RL components, confirmed their contributions to the model's overall performance, underscoring the effectiveness of addressing these critical technical challenges.


Subject(s)
Deep Learning , Magnetic Resonance Imaging , Myocarditis , Myocarditis/diagnostic imaging , Humans , Image Processing, Computer-Assisted/methods , Neural Networks, Computer
2.
Sci Rep ; 14(1): 9560, 2024 Apr 26.
Article in English | MEDLINE | ID: mdl-38671139

ABSTRACT

The number of patents increases quickly, while more and more low-quality patents are emerging. It's important to identify high-quality patents from massive data quickly and accurately for organizational R&D decision-making and patent layout. However, due to low percentage of high-quality patents, it is challenging to identify them efficiently. In order to solve above problem, we reconstruct the existing index system for identifying high-quality patents by adding 4 features from technological strength of patentees. Furthermore, we propose an improved model by integrating resampling technique and ensemble learning algorithm. First, generative adversarial networks (GAN) are used to expand minority samples. Second, Extreme Gradient Boosting algorithm (XGBoost) with Bayesian optimization (BO) is used to identify high-quality patents. For clarity, this model is called a GAN-BO-XGBoost model. To test the effectiveness of above model, we use patent data in field of lithography technology. Tenfold cross-validation is carried out to evaluate the performance between our proposed model and other models. The results show that GAN-BO-XGBoost model performs better and it's more stable than other models.

3.
IEEE Trans Artif Intell ; 5(1): 80-91, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38500544

ABSTRACT

Deep learning models perform remarkably well on many classification tasks recently. The superior performance of deep neural networks relies on the large number of training data, which at the same time must have an equal class distribution in order to be efficient. However, in most real-world applications, the labeled data may be limited with high imbalance ratios among the classes, and thus, the learning process of most classification algorithms is adversely affected resulting in unstable predictions and low performance. Three main categories of approaches address the problem of imbalanced learning, i.e., data-level, algorithmic level, and hybrid methods, which combine the two aforementioned approaches. Data generative methods are typically based on generative adversarial networks, which require significant amounts of data, while model-level methods entail extensive domain expert knowledge to craft the learning objectives, thereby being less accessible for users without such knowledge. Moreover, the vast majority of these approaches are designed and applied to imaging applications, less to time series, and extremely rare to both of them. To address the above issues, we introduce GENDA, a generative neighborhood-based deep autoencoder, which is simple yet effective in its design and can be successfully applied to both image and time-series data. GENDA is based on learning latent representations that rely on the neighboring embedding space of the samples. Extensive experiments, conducted on a variety of widely-used real datasets demonstrate the efficacy of the proposed method. Impact Statement­: Imbalanced data classification is an actual and important issue in many real-world learning applications hampering most classification tasks. Fraud detection, biomedical imaging categorizing healthy people versus patients, and object detection are some indicative domains with an economic, social and technological impact, which are greatly affected by inherent imbalanced data distribution. However, the majority of the existing algorithms that address the imbalanced classification problem are designed with a particular application in mind, and thus they can be used with specific datasets and even hyperparameters. The generative model introduced in this paper overcomes this limitation and produces improved results for a large class of imaging and time series data even under severe imbalance ratios, making it quite competitive.

4.
Math Biosci Eng ; 21(1): 1472-1488, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38303473

ABSTRACT

Non-classical secreted proteins (NCSPs) refer to a group of proteins that are located in the extracellular environment despite the absence of signal peptides and motifs. They usually play different roles in intercellular communication. Therefore, the accurate prediction of NCSPs is a critical step to understanding in depth their associated secretion mechanisms. Since the experimental recognition of NCSPs is often costly and time-consuming, computational methods are desired. In this study, we proposed an ensemble learning framework, termed NCSP-PLM, for the identification of NCSPs by extracting feature embeddings from pre-trained protein language models (PLMs) as input to several fine-tuned deep learning models. First, we compared the performance of nine PLM embeddings by training three neural networks: Multi-layer perceptron (MLP), attention mechanism and bidirectional long short-term memory network (BiLSTM) and selected the best network model for each PLM embedding. Then, four models were excluded due to their below-average accuracies, and the remaining five models were integrated to perform the prediction of NCSPs based on the weighted voting. Finally, the 5-fold cross validation and the independent test were conducted to evaluate the performance of NCSP-PLM on the benchmark datasets. Based on the same independent dataset, the sensitivity and specificity of NCSP-PLM were 91.18% and 97.06%, respectively. Particularly, the overall accuracy of our model achieved 94.12%, which was 7~16% higher than that of the existing state-of-the-art predictors. It indicated that NCSP-PLM could serve as a useful tool for the annotation of NCSPs.


Subject(s)
Deep Learning , Neural Networks, Computer , Proteins , Language , Sensitivity and Specificity
5.
Comput Biol Med ; 170: 108063, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38301519

ABSTRACT

Cancer is a serious malignant tumor and is difficult to cure. Chemotherapy, as a primary treatment for cancer, causes significant harm to normal cells in the body and is often accompanied by serious side effects. Recently, anti-cancer peptides (ACPs) as a type of protein for treating cancers dominated research into the development of new anti-tumor drugs because of their ability to specifically target and destroy cancer cells. The screening of proteins with cancer-inhibiting properties from a large pool of proteins is key to the development of anti-tumor drugs. However, it is expensive and inefficient to accurately identify protein functions only through biological experiments due to their complex structure. Therefore, we propose a new prediction model ACP-ML to effectively predict ACPs. In terms of feature extraction, DPC, PseAAC, CTDC, CTDT and CS-Pse-PSSM features were used and the most optimal feature set was selected by comparing combinations of these features. Then, a two-step feature selection process using MRMD and RFE algorithms was performed to determine the most crucial features from the most optimal feature set for identifying ACPs. Furthermore, we assessed the classification accuracy of single learning models and different strategies-based ensemble models through ten-fold cross-validation. Ultimately, a voting-based ensemble learning method is developed to predict ACPs. To validate its effectiveness, two independent test sets were used to perform tests, achieving accuracy of 90.891 % and 92.578 % respectively. Compared with existing anticancer peptide prediction algorithms, the proposed feature processing method is more effective, and the proposed ensemble model ACP-ML exhibits stronger generalization capability and higher accuracy.


Subject(s)
Antineoplastic Agents , Neoplasms , Humans , Computational Biology/methods , Peptides/chemistry , Proteins , Algorithms , Neoplasms/drug therapy , Antineoplastic Agents/pharmacology , Antineoplastic Agents/therapeutic use
6.
J Sci Food Agric ; 104(4): 1984-1991, 2024 Mar 15.
Article in English | MEDLINE | ID: mdl-37899531

ABSTRACT

BACKGROUND: Paralytic shellfish poisoning caused by human consumption of shellfish fed on toxic algae is a public health hazard. It is essential to implement shellfish monitoring programs to minimize the possibility of shellfish contaminated by paralytic shellfish toxins (PST) reaching the marketplace. RESULTS: This paper proposes a rapid detection method for PST in mussels using near-infrared spectroscopy (NIRS) technology. Spectral data in the wavelength range of 950-1700 nm for PST-contaminated and non-contaminated mussel samples were used to build the detection model. Near-Bayesian support vector machines (NBSVM) with unequal misclassification costs (u-NBSVM) were applied to solve a classification problem arising from the fact that the quantity of non-contaminated mussels was far less than that of PST-contaminated mussels in practice. The u-NBSVM model performed adequately on imbalanced datasets by combining unequal misclassification costs and decision boundary shifts. The detection performance of the u-NBSVM did not decline as the number of PST samples decreased due to adjustments to the misclassification costs. When the number of PST samples was 20, the G-mean and accuracy reached 0.9898 and 0.9944, respectively. CONCLUSION: Compared with the traditional support vector machines (SVMs) and the NBSVM, the u-NBSVM model achieved better detection performance. The results of this study indicate that NIRS technology combined with the u-NBSVM model can be used for rapid and non-destructive PST detection in mussels. © 2023 Society of Chemical Industry.


Subject(s)
Bivalvia , Support Vector Machine , Animals , Humans , Bayes Theorem , Spectroscopy, Near-Infrared , Bivalvia/chemistry , Shellfish/analysis
7.
Physiol Meas ; 44(12)2023 Dec 29.
Article in English | MEDLINE | ID: mdl-38081126

ABSTRACT

Objective.Pre-participation medical screening of athletes is necessary to pinpoint individuals susceptible to cardiovascular events.Approach.The article presents a reinforcement learning (RL)-based multilayer perceptron, termed MLP-RL-CRD, designed to detect cardiovascular risk among athletes. The model underwent training using a publicized dataset that included the anthropological measurements (such as height and weight) and biomedical metrics (covering blood pressure and pulse rate) of 26 002 athletes. To address the data imbalance, a novel RL-based technique was adopted. The problem was framed as a series of sequential decisions in which an agent classified a received instance and received a reward at each level. To resolve the insensitivity to the initialization of conventional gradient-based learning methods, a mutual learning-based artificial bee colony (ML-ABC) was proposed.Main Results.The model outcomes were validated against positive (P) and negative (N) ECG findings that had been labeled by experts to signify individuals 'at risk' and 'not at risk,' respectively. The MLP-RL-CRD approach achieves superior outcomes (F-measure 87.4%; geometric mean 89.6%) compared with other deep models and traditional machine learning techniques. Optimal values for crucial parameters, including the reward function, were identified for the model based on experiments on the study dataset. Ablation studies, which omitted elements of the suggested model, affirmed the autonomous, positive, stepwise influence of these components on performing the model.Significance.This study introduces a novel, effective method for early cardiovascular risk detection in athletes, merging reinforcement learning and multilayer perceptrons, advancing medical screening and predictive healthcare. The results could have far-reaching implications for athlete health management and the broader field of predictive healthcare analytics.


Subject(s)
Cardiovascular Diseases , Humans , Cardiovascular Diseases/diagnosis , Risk Factors , Neural Networks, Computer , Machine Learning , Athletes
8.
Digit Health ; 9: 20552076231211550, 2023.
Article in English | MEDLINE | ID: mdl-37936958

ABSTRACT

Objective: Sleep apnea is a common sleep disorder affecting a significant portion of the population, but many apnea patients remain undiagnosed because existing clinical tests are invasive and expensive. This study aimed to develop a method for easy sleep apnea screening. Methods: Three supervised machine learning algorithms, including logistic regression, support vector machine, and light gradient boosting machine, were applied to develop apnea screening models at two apnea-hypopnea index cutoff thresholds: ≥ 5 and ≥ 30 events/hours. The SpO2 recordings of the Sleep Heart Health Study database (N = 5786) were used for model training, validation, and test. Multiscale entropy analysis was performed to derive a set of multiscale attention entropy features from the SpO2 recordings. Demographic features including age, sex, body mass index, and blood pressure were also used. The dependency among the multiscale attention entropy features were handled with the independent component analysis. Results: For cutoff ≥ 5/hours, logistic regression model achieved the highest Matthew's correlation coefficient (0.402) and area under the curve (0.747), and reasonably good sensitivity (75.38%), specificity (74.02%), and positive predictive value (92.94%). For cutoff ≥ 30/hours, support vector machine model achieved the highest Matthew's correlation coefficient (0.545) and area under the curve (0.823), and good sensitivity (82.00%), specificity (82.69%), and negative predictive value (95.53%). Conclusions: Our models achieved better performance than existing methods and have the potential to be integrated with home-use pulse oximeters.

9.
Sensors (Basel) ; 23(17)2023 Aug 29.
Article in English | MEDLINE | ID: mdl-37687952

ABSTRACT

With the rapid development of the Internet of Things (IoT), the frequency of attackers using botnets to control IoT devices in order to perform distributed denial-of-service attacks (DDoS) and other cyber attacks on the internet has significantly increased. In the actual attack process, the small percentage of attack packets in IoT leads to low accuracy of intrusion detection. Based on this problem, the paper proposes an oversampling algorithm, KG-SMOTE, based on Gaussian distribution and K-means clustering, which inserts synthetic samples through Gaussian probability distribution, extends the clustering nodes in minority class samples in the same proportion, increases the density of minority class samples, and improves the amount of minority class sample data in order to provide data support for IoT-based DDoS attack detection. Experiments show that the balanced dataset generated by this method effectively improves the intrusion detection accuracy in each category and effectively solves the data imbalance problem.

10.
Cancers (Basel) ; 15(12)2023 Jun 08.
Article in English | MEDLINE | ID: mdl-37370717

ABSTRACT

Valvular Heart Disease (VHD) is a known late complication of radiotherapy for childhood cancer (CC), and identifying high-risk survivors correctly remains a challenge. This paper focuses on the distribution of the radiation dose absorbed by heart tissues. We propose that a dosiomics signature could provide insight into the spatial characteristics of the heart dose associated with a VHD, beyond the already-established risk induced by high doses. We analyzed data from the 7670 survivors of the French Childhood Cancer Survivors' Study (FCCSS), 3902 of whom were treated with radiotherapy. In all, 63 (1.6%) survivors that had been treated with radiotherapy experienced a VHD, and 57 of them had heterogeneous heart doses. From the heart-dose distribution of each survivor, we extracted 93 first-order and spatial dosiomics features. We trained random forest algorithms adapted for imbalanced classification and evaluated their predictive performance compared to the performance of standard mean heart dose (MHD)-based models. Sensitivity analyses were also conducted for sub-populations of survivors with spatially heterogeneous heart doses. Our results suggest that MHD and dosiomics-based models performed equally well globally in our cohort and that, when considering the sub-population having received a spatially heterogeneous dose distribution, the predictive capability of the models is significantly improved by the use of the dosiomics features. If these findings are further validated, the dosiomics signature may be incorporated into machine learning algorithms for radiation-induced VHD risk assessment and, in turn, into the personalized refinement of follow-up guidelines.

11.
BMC Res Notes ; 16(1): 11, 2023 Feb 02.
Article in English | MEDLINE | ID: mdl-36732807

ABSTRACT

OBJECTIVES: Antibiotic resistance is a rising global threat to human health and is prompting researchers to seek effective alternatives to conventional antibiotics, which include antimicrobial peptides (AMPs). Recently, we have reported AMPlify, an attentive deep learning model for predicting AMPs in databases of peptide sequences. In our tests, AMPlify outperformed the state-of-the-art. We have illustrated its use on data describing the American bullfrog (Rana [Lithobates] catesbeiana) genome. Here we present the model files and training/test data sets we used in that study. The original model (the balanced model) was trained on a balanced set of AMP and non-AMP sequences curated from public databases. In this data note, we additionally provide a model trained on an imbalanced set, in which non-AMP sequences far outnumber AMP sequences. We note that the balanced and imbalanced models would serve different use cases, and both would serve the research community, facilitating the discovery and development of novel AMPs. DATA DESCRIPTION: This data note provides two sets of models, as well as two AMP and four non-AMP sequence sets for training and testing the balanced and imbalanced models. Each model set includes five single sub-models that form an ensemble model. The first model set corresponds to the original model trained on a balanced training set that has been described in the original AMPlify manuscript, while the second model set was trained on an imbalanced training set.


Subject(s)
Antimicrobial Peptides , Deep Learning , Animals , Amino Acid Sequence , Anti-Bacterial Agents , Rana catesbeiana/genetics
12.
J Biol Eng ; 17(1): 7, 2023 Jan 30.
Article in English | MEDLINE | ID: mdl-36717866

ABSTRACT

BACKGROUND: In the current genomic era, gene expression datasets have become one of the main tools utilized in cancer classification. Both curse of dimensionality and class imbalance problems are inherent characteristics of these datasets. These characteristics have a negative impact on the performance of most classifiers when used to classify cancer using genomic datasets. RESULTS: This paper introduces Reduced Noise-Autoencoder (RN-Autoencoder) for pre-processing imbalanced genomic datasets for precise cancer classification. Firstly, RN-Autoencoder solves the curse of dimensionality problem by utilizing the autoencoder for feature reduction and hence generating new extracted data with lower dimensionality. In the next stage, RN-Autoencoder introduces the extracted data to the well-known Reduced Noise-Synthesis Minority Over Sampling Technique (RN- SMOTE) that efficiently solve the problem of class imbalance in the extracted data. RN-Autoencoder has been evaluated using different classifiers and various imbalanced datasets with different imbalance ratios. The results proved that the performance of the classifiers has been improved with RN-Autoencoder and outperformed the performance with original data and extracted data with percentages based on the classifier, dataset and evaluation metric. Also, the performance of RN-Autoencoder has been compared to the performance of the current state of the art and resulted in an increase up to 18.017, 19.183, 18.58 and 8.87% in terms of test accuracy using colon, leukemia, Diffuse Large B-Cell Lymphoma (DLBCL) and Wisconsin Diagnostic Breast Cancer (WDBC) datasets respectively. CONCLUSION: RN-Autoencoder is a model for cancer classification using imbalanced gene expression datasets. It utilizes the autoencoder to reduce the high dimensionality of the gene expression datasets and then handles the class imbalance using RN-SMOTE. RN-Autoencoder has been evaluated using many different classifiers and many different imbalanced datasets. The performance of many classifiers has improved and some have succeeded in classifying cancer with 100% performance in terms of all used metrics. In addition, RN-Autoencoder outperformed many recent works using the same datasets.

13.
Comput Biol Med ; 152: 106372, 2023 01.
Article in English | MEDLINE | ID: mdl-36516574

ABSTRACT

Uncontrolled proliferation of B-lymphoblast cells is a common characterization of Acute Lymphoblastic Leukemia (ALL). B-lymphoblasts are found in large numbers in peripheral blood in malignant cases. Early detection of the cell in bone marrow is essential as the disease progresses rapidly if left untreated. However, automated classification of the cell is challenging, owing to its fine-grained variability with B-lymphoid precursor cells and imbalanced data points. Deep learning algorithms demonstrate potential for such fine-grained classification as well as suffer from the imbalanced class problem. In this paper, we explore different deep learning-based State-Of-The-Art (SOTA) approaches to tackle imbalanced classification problems. Our experiment includes input, GAN (Generative Adversarial Networks), and loss-based methods to mitigate the issue of imbalanced class on the challenging C-NMC and ALLIDB-2 dataset for leukemia detection. We have shown empirical evidence that loss-based methods outperform GAN-based and input-based methods in imbalanced classification scenarios.


Subject(s)
Algorithms , Precursor Cell Lymphoblastic Leukemia-Lymphoma , Humans , Precursor Cell Lymphoblastic Leukemia-Lymphoma/diagnosis , Precursor Cell Lymphoblastic Leukemia-Lymphoma/pathology
14.
J Comput Soc Sci ; 6(1): 91-163, 2023.
Article in English | MEDLINE | ID: mdl-36568019

ABSTRACT

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477-5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.

15.
J Cheminform ; 14(1): 80, 2022 Nov 10.
Article in English | MEDLINE | ID: mdl-36357942

ABSTRACT

While in the last years there has been a dramatic increase in the number of available bioassay datasets, many of them suffer from extremely imbalanced distribution between active and inactive compounds. Thus, there is an urgent need for novel approaches to tackle class imbalance in drug discovery. Inspired by recent advances in computer vision, we investigated a panel of alternative loss functions for imbalanced classification in the context of Gradient Boosting and benchmarked them on six datasets from public and proprietary sources, for a total of 42 tasks and 2 million compounds. Our findings show that with these modifications, we achieve statistically significant improvements over the conventional cross-entropy loss function on five out of six datasets. Furthermore, by employing these bespoke loss functions we are able to push Gradient Boosting to match or outperform a wide variety of previously reported classifiers and neural networks. We also investigate the impact of changing the loss function on training time and find that it increases convergence speed up to 8 times faster. As such, these results show that tuning the loss function for Gradient Boosting is a straightforward and computationally efficient method to achieve state-of-the-art performance on imbalanced bioassay datasets without compromising on interpretability and scalability.

16.
Math Biosci Eng ; 19(10): 10006-10021, 2022 07 13.
Article in English | MEDLINE | ID: mdl-36031980

ABSTRACT

Electronic Medical Record (EMR) is the data basis of intelligent diagnosis. The diagnosis results of an EMR are multi-disease, including normal diagnosis, pathological diagnosis and complications, so intelligent diagnosis can be treated as multi-label classification problem. The distribution of diagnostic results in EMRs is imbalanced. And the diagnostic results in one EMR have a high coupling degree. The traditional rebalancing methods does not function effectively on highly coupled imbalanced datasets. This paper proposes Double Decoupled Network (DDN) based intelligent diagnosis model, which decouples representation learning and classifier learning. In the representation learning stage, Convolutional Neural Networks (CNN) is used to learn the original features of the data. In the classifier learning stage, a Decoupled and Rebalancing highly Imbalanced Labels (DRIL) algorithm is proposed to decouple the highly coupled diagnostic results and rebalance the datasets, and then the balanced datasets is used to train the classifier. This paper evaluates the proposed DDN using Chinese Obstetric EMR (COEMR) datasets, and verifies the effectiveness and universality of the model on two benchmark multi-label text classification datasets: Arxiv Academic Papers Datasets (AAPD) and Reuters Corpus1 (RCV1). Demonstrating the effectiveness of the proposed methods is an imbalanced obstetric EMRs. The accuracy of DDN model on COEMR, AAPD and RCV1 datasets is 84.17, 86.35 and 93.87% respectively, which is higher than the current optimal experimental results.


Subject(s)
Algorithms , Neural Networks, Computer , Electronic Health Records
17.
Sensors (Basel) ; 22(14)2022 Jul 07.
Article in English | MEDLINE | ID: mdl-35890775

ABSTRACT

Time-series representation is the most important task in time-series analysis. One of the most widely employed time-series representation method is symbolic aggregate approximation (SAX), which converts the results from piecewise aggregate approximation to a symbol sequence. SAX is a simple and effective method; however, it only focuses on the mean value of each segment in the time-series. Here, we propose a novel time-series representation method-distance- and momentum-based symbolic aggregate approximation (DM-SAX)-that can secure time-series distributions by calculating the perpendicular distance from the time-axis to each data point and consider the time-series trend by adding a momentum factor reflecting the direction of previous data points. Experimental results for 29 highly imbalanced classification problems on the UCR datasets revealed that DM-SAX affords the optimal area under the curve (AUC) among competing time-series representation methods (SAX, extreme-SAX, overlap-SAX, and distance-based SAX). We statistically verified that performance improvements resulted in significant differences in the rankings. In addition, DM-SAX yielded the optimal AUC for real-world wire cutting and crimping process dataset. Meaningful data points such as outliers could be identified in a time-series outlier detection framework via the proposed method.


Subject(s)
Area Under Curve , Motion , Time Factors
18.
EPJ Data Sci ; 11(1): 41, 2022.
Article in English | MEDLINE | ID: mdl-35873664

ABSTRACT

Sustainability in tourism is a topic of global relevance, finding multiple mentions in the United Nations Sustainable Development Goals. The complex task of balancing tourism's economic, environmental, and social effects requires detailed and up-to-date data. This paper investigates whether online platform data can be employed as an alternative data source in sustainable tourism statistics. Using a web-scraped dataset from a large online tourism platform, a sustainability label for accommodations can be predicted reasonably well with machine learning techniques. The algorithmic prediction of accommodations' sustainability using online data can provide a cost-effective and accurate measure that allows to track developments of tourism sustainability across the globe with high spatial and temporal granularity. Supplementary Information: The online version contains supplementary material available at 10.1140/epjds/s13688-022-00354-6.

19.
Comput Biol Chem ; 98: 107646, 2022 Jun.
Article in English | MEDLINE | ID: mdl-35240419

ABSTRACT

Imbalanced data classification is the fundamental problem of data mining. Relevant researchers have proposed many solutions to solve the problem, such as sampling and ensemble learning methods. However, random under-sampling is easy to lose representative samples, and ensemble learning does not use the correlation information between pieces in the data set. Therefore, we proposed a Hybrid Adaptive sampling with Bagging Classifier(HABC). Specifically, we calculated the adaptive sampling rate according to the characteristics of the data set. We then performed density-based under-sampling and over-sampling on the original data set according to the sampling rate. Further, the sampled data subset was sent to the Bagging classifier, and the classifier was employed to predict the unknown data set. In addition, the multi-objective particle swarm optimization algorithm was combined to optimize the prediction result. Extensive experiments based on UCI, KEEL, and three bioinformatics datasets show that our proposed method is better than state-of-the-art algorithms.


Subject(s)
Algorithms , Computational Biology , Biological Evolution , Computational Biology/methods , Data Mining/methods
20.
Int J Neural Syst ; 32(6): 2250017, 2022 Jun.
Article in English | MEDLINE | ID: mdl-35306966

ABSTRACT

Automatic epilepsy detection is of great significance for the diagnosis and treatment of patients. Most detection methods are based on patient-specific models and have achieved good results. However, in practice, new patients do not have their own previous EEG data and therefore cannot be initially diagnosed. If the EEG data of other patients can be used to achieve cross-patient detection, and cross-patient and patient-specific experiments can be combined at the same time, this method will be more widely used. In this work, an EEG classification model based on a self-organizing fuzzy logic (SOF) classifier is proposed for both cross-patient and patient-specific seizure detection. After preprocessing, the features of the original EEG signal are extracted and sent to the SOF classifier. This classification model is free from predefined parameters or a prior assumption regarding the EEG data generation model and only stores the key meta-parameters in memory. Therefore, it is very suitable for large-scale EEG signals in cross-patient detection. Selecting different granularity and classification distance in two different experiments after post-processing will achieve the best results. Experiments were conducted using a long-term continuous scalp EEG database and the [Formula: see text]-mean of cross-patient and patient-specific detection reached 83.35% and 92.04%, respectively. A comparison with other methods shows that there is greater performance and generalizability with this method.


Subject(s)
Fuzzy Logic , Signal Processing, Computer-Assisted , Algorithms , Electroencephalography/methods , Humans , Seizures/diagnosis
SELECTION OF CITATIONS
SEARCH DETAIL
...