RESUMO
Peptide- and protein-based therapeutics are becoming a promising treatment regimen for myriad diseases. Toxicity of proteins is the primary hurdle for protein-based therapies. Thus, there is an urgent need for accurate in silico methods for determining toxic proteins to filter the pool of potential candidates. At the same time, it is imperative to precisely identify non-toxic proteins to expand the possibilities for protein-based biologics. To address this challenge, we proposed an ensemble framework, called VISH-Pred, comprising models built by fine-tuning ESM2 transformer models on a large, experimentally validated, curated dataset of protein and peptide toxicities. The primary steps in the VISH-Pred framework are to efficiently estimate protein toxicities taking just the protein sequence as input, employing an under sampling technique to handle the humongous class-imbalance in the data and learning representations from fine-tuned ESM2 protein language models which are then fed to machine learning techniques such as Lightgbm and XGBoost. The VISH-Pred framework is able to correctly identify both peptides/proteins with potential toxicity and non-toxic proteins, achieving a Matthews correlation coefficient of 0.737, 0.716 and 0.322 and F1-score of 0.759, 0.696 and 0.713 on three non-redundant blind tests, respectively, outperforming other methods by over $10\%$ on these quality metrics. Moreover, VISH-Pred achieved the best accuracy and area under receiver operating curve scores on these independent test sets, highlighting the robustness and generalization capability of the framework. By making VISH-Pred available as an easy-to-use web server, we expect it to serve as a valuable asset for future endeavors aimed at discerning the toxicity of peptides and enabling efficient protein-based therapeutics.
Assuntos
Proteínas , Proteínas/metabolismo , Proteínas/química , Aprendizado de Máquina , Bases de Dados de Proteínas , Biologia Computacional/métodos , Humanos , Peptídeos/toxicidade , Peptídeos/química , Simulação por Computador , Algoritmos , SoftwareRESUMO
Peptide hormones serve as genome-encoded signal transduction molecules that play essential roles in multicellular organisms, and their dysregulation can lead to various health problems. In this study, we propose a method for predicting hormonal peptides with high accuracy. The dataset used for training, testing, and evaluating our models consisted of 1174 hormonal and 1174 non-hormonal peptide sequences. Initially, we developed similarity-based methods utilizing BLAST and MERCI software. Although these similarity-based methods provided a high probability of correct prediction, they had limitations, such as no hits or prediction of limited sequences. To overcome these limitations, we further developed machine and deep learning-based models. Our logistic regression-based model achieved a maximum AUROC of 0.93 with an accuracy of 86% on an independent/validation dataset. To harness the power of similarity-based and machine learning-based models, we developed an ensemble method that achieved an AUROC of 0.96 with an accuracy of 89.79% and a Matthews correlation coefficient (MCC) of 0.8 on the validation set. To facilitate researchers in predicting and designing hormone peptides, we developed a web-based server called HOPPred. This server offers a unique feature that allows the identification of hormone-associated motifs within hormone peptides. The server can be accessed at: https://webs.iiitd.edu.in/raghava/hoppred/.
Assuntos
Aprendizado de Máquina , Hormônios Peptídicos , Software , Hormônios Peptídicos/química , Humanos , Biologia Computacional/métodos , Bases de Dados de ProteínasRESUMO
Prediction of antifreeze proteins (AFPs) holds significant importance due to their diverse applications in healthcare. An inherent limitation of current AFP prediction methods is their reliance on unreviewed proteins for evaluation. This study evaluates, proposed and existing methods on an independent dataset containing 80 AFPs and 73 non-AFPs obtained from Uniport, which have been already reviewed by experts. Initially, we constructed machine learning models for AFP prediction using selected composition-based protein features and achieved a peak AUROC of 0.90 with an MCC of 0.69 on the independent dataset. Subsequently, we observed a notable enhancement in model performance, with the AUROC increasing from 0.90 to 0.93 upon incorporating evolutionary information instead of relying solely on the primary sequence of proteins. Furthermore, we explored hybrid models integrating our machine learning approaches with BLAST-based similarity and motif-based methods. However, the performance of these hybrid models either matched or was inferior to that of our best machine-learning model. Our best model based on evolutionary information outperforms all existing methods on independent/validation dataset. To facilitate users, a user-friendly web server with a standalone package named "AFPropred" was developed (https://webs.iiitd.edu.in/raghava/afpropred).
RESUMO
Breast cancer remains a major public health challenge worldwide. The identification of accurate biomarkers is critical for the early detection and effective treatment of breast cancer. This study utilizes an integrative machine learning approach to analyze breast cancer gene expression data for superior biomarker and drug target discovery. Gene expression datasets, obtained from the GEO database, were merged post-preprocessing. From the merged dataset, differential expression analysis between breast cancer and normal samples revealed 164 differentially expressed genes. Meanwhile, a separate gene expression dataset revealed 350 differentially expressed genes. Additionally, the BGWO_SA_Ens algorithm, integrating binary grey wolf optimization and simulated annealing with an ensemble classifier, was employed on gene expression datasets to identify predictive genes including TOP2A, AKR1C3, EZH2, MMP1, EDNRB, S100B, and SPP1. From over 10,000 genes, BGWO_SA_Ens identified 1404 in the merged dataset (F1 score: 0.981, PR-AUC: 0.998, ROC-AUC: 0.995) and 1710 in the GSE45827 dataset (F1 score: 0.965, PR-AUC: 0.986, ROC-AUC: 0.972). The intersection of DEGs and BGWO_SA_Ens selected genes revealed 35 superior genes that were consistently significant across methods. Enrichment analyses uncovered the involvement of these superior genes in key pathways such as AMPK, Adipocytokine, and PPAR signaling. Protein-protein interaction network analysis highlighted subnetworks and central nodes. Finally, a drug-gene interaction investigation revealed connections between superior genes and anticancer drugs. Collectively, the machine learning workflow identified a robust gene signature for breast cancer, illuminated their biological roles, interactions and therapeutic associations, and underscored the potential of computational approaches in biomarker discovery and precision oncology.
Assuntos
Biomarcadores Tumorais , Neoplasias da Mama , Humanos , Feminino , Biomarcadores Tumorais/genética , Medicina de Precisão , Algoritmos , Sistemas de Liberação de Medicamentos , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/genéticaRESUMO
Accurate exposure assessment is important for conducting PM10-2.5-related epidemiological studies, which have been limited thus far. In this study, we aimed to develop an ensemble machine learning method to estimate PM10-2.5 concentrations in mainland China during 2013-2020. The study was conducted in two stages. In the first stage, we developed two methods: the indirect method refers to developing models for PM2.5 and PM10 separately and subsequently calculating PM10-2.5 as the difference between them; and the direct method refers to establishing a model between PM10-2.5 measurements and relevant predictors directly. In the second stage, we employed an ensemble method by integrating predictions from both indirect and direct methods. Internal and external cross-validation (CV) were performed to validate the extrapolation capacity of models. The ensemble method demonstrated enhanced extrapolation accuracy in both internal and external CV compared to indirect and direct methods. The predictions produced by the ensemble method captured the spatiotemporal pattern of PM10-2.5, even in the sand and dust storm seasons. Our study introduces an ensemble strategy leveraging the strengths of both indirect and direct methods to estimate PM10-2.5 concentrations, which holds significant potential to support future epidemiological studies to address knowledge gaps in understanding the health effects of PM10-2.5.
Assuntos
Aprendizado de Máquina , Material Particulado , China , Poluentes Atmosféricos , Monitoramento Ambiental/métodos , Modelos TeóricosRESUMO
Physiotherapy plays a crucial role in the rehabilitation of damaged or defective organs due to injuries or illnesses, often requiring long-term supervision by a physiotherapist in clinical settings or at home. AI-based support systems have been developed to enhance the precision and effectiveness of physiotherapy, particularly during the COVID-19 pandemic. These systems, which include game-based or tele-rehabilitation monitoring using camera-based optical systems like Vicon and Microsoft Kinect, face challenges such as privacy concerns, occlusion, and sensitivity to environmental light. Non-optical sensor alternatives, such as Inertial Movement Units (IMUs), Wi-Fi, ultrasound sensors, and ultrawide band (UWB) radar, have emerged to address these issues. Although IMUs are portable and cost-effective, they suffer from disadvantages like drift over time, limited range, and susceptibility to magnetic interference. In this study, a single UWB radar was utilized to recognize five therapeutic exercises related to the upper limb, performed by 34 male volunteers in a real environment. A novel feature fusion approach was developed to extract distinguishing features for these exercises. Various machine learning methods were applied, with the EnsembleRRGraBoost ensemble method achieving the highest recognition accuracy of 99.45%. The performance of the EnsembleRRGraBoost model was further validated using five-fold cross-validation, maintaining its high accuracy.
Assuntos
COVID-19 , Aprendizado de Máquina , Radar , Humanos , Masculino , SARS-CoV-2 , Terapia por Exercício/métodos , Algoritmos , AdultoRESUMO
The healthcare industry went through reformation by integrating the Internet of Medical Things (IoMT) to enable data harnessing by transmission mediums from different devices, about patients to healthcare staff devices, for further analysis through cloud-based servers for proper diagnosis of patients, yielding efficient and accurate results. However, IoMT technology is accompanied by a set of drawbacks in terms of security risks and vulnerabilities, such as violating and exposing patients' sensitive and confidential data. Further, the network traffic data is prone to interception attacks caused by a wireless type of communication and alteration of data, which could cause unwanted outcomes. The advocated scheme provides insight into a robust Intrusion Detection System (IDS) for IoMT networks. It leverages a honeypot to divert attackers away from critical systems, reducing the attack surface. Additionally, the IDS employs an ensemble method combining Logistic Regression and K-Nearest Neighbor algorithms. This approach harnesses the strengths of both algorithms to improve attack detection accuracy and robustness. This work analyzes the impact, performance, accuracy, and precision outcomes of the used model on two IoMT-related datasets which contain multiple attack types such as Man-In-The-Middle (MITM), Data Injection, and Distributed Denial of Services (DDOS). The yielded results showed that the proposed ensemble method was effective in detecting intrusion attempts and classifying them as attacks or normal network traffic, with a high accuracy of 92.5% for the first dataset and 99.54% for the second dataset and a precision of 96.74% for the first dataset and 99.228% for the second dataset.
Assuntos
Algoritmos , Segurança Computacional , Atenção à Saúde , Internet das Coisas , Humanos , Tecnologia sem Fio , Computação em Nuvem , ConfidencialidadeRESUMO
Advancements in molecular biology have revolutionized our understanding of complex diseases, with Alzheimer's disease being a prime example. Single-cell sequencing, currently the most suitable technology, facilitates profoundly detailed disease analysis at the cellular level. Prior research has established that the pathology of Alzheimer's disease varies across different brain regions and cell types. In parallel, only machine learning has the capacity to address the myriad challenges presented by such studies, where the integration of large-scale data and numerous experiments is required to extract meaningful knowledge. Our methodology utilizes single-cell RNA sequencing data from healthy and Alzheimer's disease (AD) samples, focused on the cortex and hippocampus regions in mice. We designed three distinct case studies and implemented an ensemble feature selection approach through machine learning, also performing an analysis of distinct age-related datasets to unravel age-specific effects, showing differential gene expression patterns within each condition. Important evidence was reported, such as enrichment in central nervous system development and regulation of oligodendrocyte differentiation between the hippocampus and cortex of 6-month-old AD mice as well as regulation of epinephrine secretion and dendritic spine morphogenesis in 15-month-old AD mice. Our outcomes from all three of our case studies illustrate the capacity of machine learning strategies when applied to single-cell data, revealing critical insights into Alzheimer's disease.
RESUMO
Computational methods based on whole genome linked-reads and short-reads have been successful in genome assembly and detection of structural variants (SVs). Numerous variant callers that rely on linked-reads and short reads can detect genetic variations, including SVs. A shortcoming of existing tools is a propensity for overestimating SVs, especially for deletions. Optimizing the advantages of linked-read and short-read sequencing technologies would thus benefit from an additional step to effectively identify and eliminate false positive large deletions. Here, we introduce a novel tool, AquilaDeepFilter, aiming to automatically filter genome-wide false positive large deletions. Our approach relies on transforming sequencing data into an image and then relying on convolutional neural networks to improve classification of candidate deletions as such. Input data take into account multiple alignment signals including read depth, split reads and discordant read pairs. We tested the performance of AquilaDeepFilter on five linked-reads and short-read libraries sequenced from the well-studied NA24385 sample, validated against the Genome in a Bottle benchmark. To demonstrate the filtering ability of AquilaDeepFilter, we utilized the SV calls from three upstream SV detection tools including Aquila, Aquila_stLFR and Delly as the baseline. We showed that AquilaDeepFilter increased precision while preserving the recall rate of all three tools. The overall F1-score improved by an average 20% on linked-reads and by an average of 15% on short-read data. AquilaDeepFilter also compared favorably to existing deep learning based methods for SV filtering, such as DeepSVFilter. AquilaDeepFilter is thus an effective SV refinement framework that can improve SV calling for both linked-reads and short-read data.
Assuntos
Aprendizado Profundo , Genoma Humano , Sequência de Bases , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Análise de Sequência , Análise de Sequência de DNA/métodosRESUMO
This paper discusses a semantic segmentation framework and shows its application in agricultural intelligence, such as providing environmental awareness for agricultural robots to work autonomously and efficiently. We propose an ensemble framework based on the bagging strategy and the UNet network, using RGB and HSV color spaces. We evaluated the framework on our self-built dataset (Maize) and a public dataset (Sugar Beets). Then, we compared it with UNet-based methods (single RGB and single HSV), DeepLab V3+, and SegNet. Experimental results show that our ensemble framework can synthesize the advantages of each color space and obtain the best IoUs (0.8276 and 0.6972) on the datasets (Maize and Sugar Beets), respectively. In addition, including our framework, the UNet-based methods have faster speed and a smaller parameter space than DeepLab V3+ and SegNet, which are more suitable for deployment in resource-constrained environments such as mobile robots.
RESUMO
In terms of electric vehicles (EVs), electric kickboards are crucial elements of smart transportation networks for short-distance travel that is risk-free, economical, and environmentally friendly. Forecasting the daily demand can improve the local service provider's access to information and help them manage their short-term supply more effectively. This study developed the forecasting model using real-time data and weather information from Jeju Island, South Korea. Cluster analysis under the rental pattern of the electric kickboard is a component of the forecasting processes. We cannot achieve noticeable results at first because of the low amount of training data. We require a lot of data to produce a solid prediction result. For the sake of the subsequent experimental procedure, we created synthetic time-series data using a generative adversarial networks (GAN) approach and combined the synthetic data with the original data. The outcomes have shown how the GAN-based synthetic data generation approach has the potential to enhance prediction accuracy. We employ an ensemble model to improve prediction results that cannot be achieved using a single regressor model. It is a weighted combination of several base regression models to one meta-regressor. To anticipate the daily demand in this study, we create an ensemble model by merging three separate base machine learning algorithms, namely CatBoost, Random Forest (RF), and Extreme Gradient Boosting (XGBoost). The effectiveness of the suggested strategies was assessed using some evaluation indicators. The forecasting outcomes demonstrate that mixing synthetic data with original data improves the robustness of daily demand forecasting and outperforms other models by generating more agreeable values for suggested assessment measures. The outcomes further show that applying ensemble techniques can reasonably increase the forecasting model's accuracy for daily electric kickboard demand.
Assuntos
Algoritmos , Redes Neurais de Computação , Tempo (Meteorologia) , Previsões , Algoritmo Florestas AleatóriasRESUMO
BACKGROUND: Injuries caused by RTA are classified under the International Classification of Diseases-10 as 'S00-T99' and represent imbalanced samples with a mortality rate of only 1.2% among all RTA victims. To predict the characteristics of external causes of road traffic accident (RTA) injuries and mortality, we compared performances based on differences in the correction and classification techniques for imbalanced samples. METHODS: The present study extracted and utilized data spanning over a 5-year period (2013-2017) from the Korean National Hospital Discharge In-depth Injury Survey (KNHDS), a national level survey conducted by the Korea Disease Control and Prevention Agency, A total of eight variables were used in the prediction, including patient, accident, and injury/disease characteristics. As the data was imbalanced, a sample consisting of only severe injuries was constructed and compared against the total sample. Considering the characteristics of the samples, preprocessing was performed in the study. The samples were standardized first, considering that they contained many variables with different units. Among the ensemble techniques for classification, the present study utilized Random Forest, Extra-Trees, and XGBoost. Four different over- and under-sampling techniques were used to compare the performance of algorithms using "accuracy", "precision", "recall", "F1", and "MCC". RESULTS: The results showed that among the prediction techniques, XGBoost had the best performance. While the synthetic minority oversampling technique (SMOTE), a type of over-sampling, also demonstrated a certain level of performance, under-sampling was the most superior. Overall, prediction by the XGBoost model with samples using SMOTE produced the best results. CONCLUSION: This study presented the results of an empirical comparison of the validity of sampling techniques and classification algorithms that affect the accuracy of imbalanced samples by combining two techniques. The findings could be used as reference data in classification analyses of imbalanced data in the medical field.
Assuntos
Acidentes de Trânsito , Algoritmos , Humanos , República da Coreia/epidemiologiaRESUMO
This paper introduces a novel approach to increase the spatiotemporal resolution of an arbitrary environmental variable. This is achieved by utilizing machine learning algorithms to construct a satellite-like image at any given time moment, based on the measurements from IoT sensors. The target variables are calculated by an ensemble of regression models. The observed area is gridded, and partitioned into Voronoi cells based on the IoT sensors, whose measurements are available at the considered time. The pixels in each cell have a separate regression model, and take into account the measurements of the central and neighboring IoT sensors. The proposed approach was used to assess NO2 data, which were obtained from the Sentinel-5 Precursor satellite and IoT ground sensors. The approach was tested with three different machine learning algorithms: 1-nearest neighbor, linear regression and a feed-forward neural network. The highest accuracy yield was from the prediction models built with the feed-forward neural network, with an RMSE of 15.49 ×10-6 mol/m2.
Assuntos
Aprendizado de Máquina , Dióxido de Nitrogênio , Algoritmos , Modelos Lineares , Redes Neurais de ComputaçãoRESUMO
Posttranscriptional crosstalk and communication between RNAs yield large regulatory competing endogenous RNA (ceRNA) networks via shared microRNAs (miRNAs), as well as miRNA synergistic networks. The ceRNA crosstalk represents a novel layer of gene regulation that controls both physiological and pathological processes such as development and complex diseases. The rapidly expanding catalogue of ceRNA regulation has provided evidence for exploitation as a general model to predict the ceRNAs in silico. In this article, we first reviewed the current progress of RNA-RNA crosstalk in human complex diseases. Then, the widely used computational methods for modeling ceRNA-ceRNA interaction networks are further summarized into five types: two types of global ceRNA regulation prediction methods and three types of context-specific prediction methods, which are based on miRNA-messenger RNA regulation alone, or by integrating heterogeneous data, respectively. To provide guidance in the computational prediction of ceRNA-ceRNA interactions, we finally performed a comparative study of different combinations of miRNA-target methods as well as five types of ceRNA identification methods by using literature-curated ceRNA regulation and gene perturbation. The results revealed that integration of different miRNA-target prediction methods and context-specific miRNA/gene expression profiles increased the performance for identifying ceRNA regulation. Moreover, different computational methods were complementary in identifying ceRNA regulation and captured different functional parts of similar pathways. We believe that the application of these computational techniques provides valuable functional insights into ceRNA regulation and is a crucial step for informing subsequent functional validation studies.
Assuntos
Biologia Computacional/métodos , MicroRNAs/genética , RNA Mensageiro/genética , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Doença/genética , Redes Reguladoras de Genes , Humanos , MicroRNAs/metabolismo , Modelos Genéticos , Neoplasias/genética , Neoplasias/metabolismo , PTEN Fosfo-Hidrolase/genética , Processamento Pós-Transcricional do RNA , RNA Mensageiro/metabolismoRESUMO
BACKGROUND: Vitiligo is an acquired pigmentary skin disorder characterized by depigmented macules and patches which brings many challenges for the patients suffering from. For vitiligo severity assessment, several scoring methods have been proposed based on morphometry and colorimetry. But, all methods suffer from much inter- and intra-observer variations for estimating the depigmented area. For all mentioned assessment methods of vitiligo disorder, accurate segmentation of the skin images for lesion detection and localization is required. The image segmentation for localizing vitiligo skin lesions has many challenges because of illumination variation, different shapes and sizes of vitiligo lesions, vague lesion boundaries and skin hairs and vignette effects. The manual image segmentation is a tedious and time-consuming task. Therefore, using automatic image segmentation methods for lesion detection is necessarily required. MATERIALS AND METHODS: In this study, a novel unsupervised stack ensemble of deep and conventional image segmentation (SEDCIS) methods is proposed for localizing vitiligo lesions in skin images. Unsupervised segmentation methods do not require prior manual segmentation of vitiligo lesions which is a tedious and time-consuming task with intra- and inter-observer variations. RESULTS: Our collected dataset includes 877 images taken from 21 patients with the resolution of 5760*3840 pixels suffering from vitiligo disorder. Experimental results show that SEDCIS outperforms the compared methods with accuracy of 97%, sensitivity of 98%, specificity of 96%, area overlapping of 94%, and Dice index of 97%. CONCLUSION: The proposed method can segment vitiligo lesions with highly reasonable performance and can be used for assessing the vitiligo lesion surface.
Assuntos
Transtornos da Pigmentação , Vitiligo , Humanos , Processamento de Imagem Assistida por Computador , Projetos de Pesquisa , Pele , Vitiligo/diagnóstico por imagemRESUMO
BACKGROUND: Under the influences of chemotherapy regimens, clinical staging, immunologic expressions and other factors, the survival rates of patients with diffuse large B-cell lymphoma (DLBCL) are different. The accurate prediction of mortality hazards is key to precision medicine, which can help clinicians make optimal therapeutic decisions to extend the survival times of individual patients with DLBCL. Thus, we have developed a predictive model to predict the mortality hazard of DLBCL patients within 2 years of treatment. METHODS: We evaluated 406 patients with DLBCL and collected 17 variables from each patient. The predictive variables were selected by the Cox model, the logistic model and the random forest algorithm. Five classifiers were chosen as the base models for ensemble learning: the naïve Bayes, logistic regression, random forest, support vector machine and feedforward neural network models. We first calibrated the biased outputs from the five base models by using probability calibration methods (including shape-restricted polynomial regression, Platt scaling and isotonic regression). Then, we aggregated the outputs from the various base models to predict the 2-year mortality of DLBCL patients by using three strategies (stacking, simple averaging and weighted averaging). Finally, we assessed model performance over 300 hold-out tests. RESULTS: Gender, stage, IPI, KPS and rituximab were significant factors for predicting the deaths of DLBCL patients within 2 years of treatment. The stacking model that first calibrated the base model by shape-restricted polynomial regression performed best (AUC = 0.820, ECE = 8.983, MCE = 21.265) in all methods. In contrast, the performance of the stacking model without undergoing probability calibration is inferior (AUC = 0.806, ECE = 9.866, MCE = 24.850). In the simple averaging model and weighted averaging model, the prediction error of the ensemble model also decreased with probability calibration. CONCLUSIONS: Among all the methods compared, the proposed model has the lowest prediction error when predicting the 2-year mortality of DLBCL patients. These promising results may indicate that our modeling strategy of applying probability calibration to ensemble learning is successful.
Assuntos
Linfoma Difuso de Grandes Células B , Teorema de Bayes , Calibragem , Humanos , Modelos Logísticos , Linfoma Difuso de Grandes Células B/tratamento farmacológico , PrognósticoRESUMO
With rapid urbanization, awareness of environmental pollution is growing rapidly and, accordingly, interest in environmental sensors that measure atmospheric and indoor air quality is increasing. Since these IoT-based environmental sensors are sensitive and value reliability, it is essential to deal with missing values, which are one of the causes of reliability problems. Characteristics that can be used to impute missing values in environmental sensors are the time dependency of single variables and the correlation between multivariate variables. However, in the existing method of imputing missing values, only one characteristic has been used and there has been no case where both characteristics were used. In this work, we introduced a new ensemble imputation method reflecting this. First, the cases in which missing values occur frequently were divided into four cases and were generated into the experimental data: communication error (aperiodic, periodic), sensor error (rapid change, measurement range). To compare the existing method with the proposed method, five methods of univariate imputation and five methods of multivariate imputation-both of which are widely used-were used as a single model to predict missing values for the four cases. The values predicted by a single model were applied to the ensemble method. Among the ensemble methods, the weighted average and stacking methods were used to derive the final predicted values and replace the missing values. Finally, the predicted values, substituted with the original data, were evaluated by a comparison between the mean absolute error (MAE) and the root mean square error (RMSE). The proposed ensemble method generally performed better than the single method. In addition, this method simultaneously considers the correlation between variables and time dependence, which are characteristics that must be considered in the environmental sensor. As a result, our proposed ensemble technique can contribute to the replacement of the missing values generated by environmental sensors, which can help to increase the reliability of environmental sensor data.
Assuntos
Projetos de Pesquisa , Reprodutibilidade dos TestesRESUMO
Acting as a virtual sensor network for household appliance energy use monitoring, non-intrusive load monitoring is emerging as the technical basis for refined electricity analysis as well as home energy management. Aiming for robust and reliable monitoring, the ensemble approach has been expected in load disaggregation, but the obstacles of design difficulty and computational inefficiency still exist. To address this, an ensemble design integrated with multi-heterogeneity is proposed for non-intrusive energy use disaggregation in this paper. Firstly, the idea of utilizing a heterogeneous design is presented, and the corresponding ensemble framework for load disaggregation is established. Then, a sparse coding model is allocated for individual classifiers, and the combined classifier is diversified by introducing different distance and similarity measures without consideration of sparsity, forming mutually heterogeneous classifiers. Lastly, a multiple-evaluations-based decision process is fine-tuned following the interactions of multi-heterogeneous committees, and finally deployed as the decision maker. Through verifications on both a low-voltage network simulator and a field measurement dataset, the proposed approach is demonstrated to be effective in enhancing load disaggregation performance robustly. By appropriately introducing the heterogeneous design into the ensemble approach, load monitoring improvements are observed with reduced computational burden, which stimulates research enthusiasm in investigating valid ensemble strategies for practical non-intrusive load monitoring implementations.
RESUMO
As a pivotal technological foundation for smart home implementation, non-intrusive load monitoring is emerging as a widely recognized and popular technology to replace the sensors or sockets networks for the detailed household appliance monitoring. In this paper, a probability model framed ensemble method is proposed for the target of robust appliance monitoring. Firstly, the non-intrusive load disaggregation-oriented ensemble architecture is presented. Then, dictionary learning model is utilized to formulate the individual classifier, while the sparse coding-based approach is capable of providing multiple solutions under greedy mechanism. Furthermore, a fully probabilistic model is established for combined classifier, where the candidate solutions are all labelled with probability scores and evaluated via two-stage decision-making. The proposed method is tested on both low-voltage network simulator platform and field measurement datasets, and the results show that the proposed ensemble method always guarantees an enhancement on the performance of non-intrusive load disaggregation. Besides, the proposed approach shows high flexibility and scalability in classification model selection. Therefore, by initializing the architecture and approach of ensemble method-based NILM, this work plays a pioneer role in using ensemble method to improve the robustness and reliability of non-intrusive appliance monitoring.
Assuntos
Reprodutibilidade dos Testes , ProbabilidadeRESUMO
The early detection of melanoma is the most efficient way to reduce its mortality rate. Dermatologists achieve this task with the help of dermoscopy, a non-invasive tool allowing the visualization of patterns of skin lesions. Computer-aided diagnosis (CAD) systems developed on dermoscopic images are needed to assist dermatologists. These systems rely mainly on multiclass classification approaches. However, the multiclass classification of skin lesions by an automated system remains a challenging task. Decomposing a multiclass problem into a binary problem can reduce the complexity of the initial problem and increase the overall performance. This paper proposes a CAD system to classify dermoscopic images into three diagnosis classes: melanoma, nevi, and seborrheic keratosis. We introduce a novel ensemble scheme of convolutional neural networks (CNNs), inspired by decomposition and ensemble methods, to improve the performance of the CAD system. Unlike conventional ensemble methods, we use a directed acyclic graph to aggregate binary CNNs for the melanoma detection task. On the ISIC 2018 public dataset, our method achieves the best balanced accuracy (76.6%) among multiclass CNNs, an ensemble of multiclass CNNs with classical aggregation methods, and other related works. Our results reveal that the directed acyclic graph is a meaningful approach to develop a reliable and robust automated diagnosis system for the multiclass classification of dermoscopic images.