Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 66
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34585231

RESUMEN

MOTIVATION: Discovering long noncoding RNA (lncRNA)-disease associations is a fundamental and critical part in understanding disease etiology and pathogenesis. However, only a few lncRNA-disease associations have been identified because of the time-consuming and expensive biological experiments. As a result, an efficient computational method is of great importance and urgently needed for identifying potential lncRNA-disease associations. With the ability of exploiting node features and relationships in network, graph-based learning models have been commonly utilized by these biomolecular association predictions. However, the capability of these methods in comprehensively fusing node features, heterogeneous topological structures and semantic information is distant from optimal or even satisfactory. Moreover, there are still limitations in modeling complex associations between lncRNAs and diseases. RESULTS: In this paper, we develop a novel heterogeneous graph attention network framework based on meta-paths for predicting lncRNA-disease associations, denoted as HGATLDA. At first, we conduct a heterogeneous network by incorporating lncRNA and disease feature structural graphs, and lncRNA-disease topological structural graph. Then, for the heterogeneous graph, we conduct multiple metapath-based subgraphs and then utilize graph attention network to learn node embeddings from neighbors of these homogeneous and heterogeneous subgraphs. Next, we implement attention mechanism to adaptively assign weights to multiple metapath-based subgraphs and get more semantic information. In addition, we combine neural inductive matrix completion to reconstruct lncRNA-disease associations, which is applied for capturing complicated associations between lncRNAs and diseases. Moreover, we incorporate cost-sensitive neural network into the loss function to tackle the commonly imbalance problem in lncRNA-disease association prediction. Finally, extensive experimental results demonstrate the effectiveness of our proposed framework.


Asunto(s)
ARN Largo no Codificante , Biología Computacional/métodos , Redes Neurales de la Computación , ARN Largo no Codificante/genética
2.
J Med Internet Res ; 25: e48244, 2023 12 22.
Artículo en Inglés | MEDLINE | ID: mdl-38133922

RESUMEN

BACKGROUND: Cardiac arrest (CA) is the leading cause of death in critically ill patients. Clinical research has shown that early identification of CA reduces mortality. Algorithms capable of predicting CA with high sensitivity have been developed using multivariate time series data. However, these algorithms suffer from a high rate of false alarms, and their results are not clinically interpretable. OBJECTIVE: We propose an ensemble approach using multiresolution statistical features and cosine similarity-based features for the timely prediction of CA. Furthermore, this approach provides clinically interpretable results that can be adopted by clinicians. METHODS: Patients were retrospectively analyzed using data from the Medical Information Mart for Intensive Care-IV database and the eICU Collaborative Research Database. Based on the multivariate vital signs of a 24-hour time window for adults diagnosed with heart failure, we extracted multiresolution statistical and cosine similarity-based features. These features were used to construct and develop gradient boosting decision trees. Therefore, we adopted cost-sensitive learning as a solution. Then, 10-fold cross-validation was performed to check the consistency of the model performance, and the Shapley additive explanation algorithm was used to capture the overall interpretability of the proposed model. Next, external validation using the eICU Collaborative Research Database was performed to check the generalization ability. RESULTS: The proposed method yielded an overall area under the receiver operating characteristic curve (AUROC) of 0.86 and area under the precision-recall curve (AUPRC) of 0.58. In terms of the timely prediction of CA, the proposed model achieved an AUROC above 0.80 for predicting CA events up to 6 hours in advance. The proposed method simultaneously improved precision and sensitivity to increase the AUPRC, which reduced the number of false alarms while maintaining high sensitivity. This result indicates that the predictive performance of the proposed model is superior to the performances of the models reported in previous studies. Next, we demonstrated the effect of feature importance on the clinical interpretability of the proposed method and inferred the effect between the non-CA and CA groups. Finally, external validation was performed using the eICU Collaborative Research Database, and an AUROC of 0.74 and AUPRC of 0.44 were obtained in a general intensive care unit population. CONCLUSIONS: The proposed framework can provide clinicians with more accurate CA prediction results and reduce false alarm rates through internal and external validation. In addition, clinically interpretable prediction results can facilitate clinician understanding. Furthermore, the similarity of vital sign changes can provide insights into temporal pattern changes in CA prediction in patients with heart failure-related diagnoses. Therefore, our system is sufficiently feasible for routine clinical use. In addition, regarding the proposed CA prediction system, a clinically mature application has been developed and verified in the future digital health field.


Asunto(s)
Paro Cardíaco , Insuficiencia Cardíaca , Adulto , Humanos , Inteligencia Artificial , Estudios Retrospectivos , Paro Cardíaco/diagnóstico , Paro Cardíaco/terapia , Insuficiencia Cardíaca/diagnóstico , Hospitales
3.
Sensors (Basel) ; 23(5)2023 Feb 27.
Artículo en Inglés | MEDLINE | ID: mdl-36904815

RESUMEN

Owing to the remarkable development of deep learning algorithms, defect detection techniques based on deep neural networks have been extensively applied in industrial production. Most existing surface defect detection models assign equal costs to the classification errors among different defect categories but do not strictly distinguish them. However, various errors can generate a great discrepancy in decision risk or classification costs and then produce a cost-sensitive issue that is crucial to the manufacturing process. To address this engineering challenge, we propose a novel supervised classification cost-sensitive learning method (SCCS) and apply it to improve YOLOv5 as CS-YOLOv5, where the classification loss function of object detection was reconstructed according to a new cost-sensitive learning criterion explained by a label-cost vector selection method. In this way, the classification risk information from a cost matrix is directly introduced into the detection model and fully exploited in training. As a result, the developed approach can make low-risk classification decisions for defect detection. It is applicable for direct cost-sensitive learning based on a cost matrix to implement detection tasks. Using two datasets of a painting surface and a hot-rolled steel strip surface, our CS-YOLOv5 model outperforms the original version with respect to cost under different positive classes, coefficients, and weight ratios, but also maintains effective detection performance measured by mAP and F1 scores.

4.
Sensors (Basel) ; 23(23)2023 Nov 28.
Artículo en Inglés | MEDLINE | ID: mdl-38067837

RESUMEN

In this work, cost-sensitive decision support was developed. Using Batch Data Analytics (BDA) methods of the batch data structure and feature accommodation, the batch process property and sensor data can be accommodated. The batch data structure organises the batch processes' data, and the feature accommodation approach derives statistics from the time series, consequently aligning the time series with the other features. Three machine learning classifiers were implemented for comparison: Logistic Regression (LR), Random Forest Classifier (RFC), and Support Vector Machine (SVM). It is possible to filter out the low-probability predictions by leveraging the classifiers' probability estimations. Consequently, the decision support has a trade-off between accuracy and coverage. Cost-sensitive learning was used to implement a cost matrix, which further aggregates the accuracy-coverage trade into cost metrics. Also, two scenarios were implemented for accommodating out-of-coverage batches. The batch is discarded in one scenario, and the other is processed. The Random Forest classifier was shown to outperform the other classifiers and, compared to the baseline scenario, had a relative cost of 26%. This synergy of methods provides cost-aware decision support for analysing the intricate workings of a multiprocess batch data system.

5.
BMC Med Inform Decis Mak ; 22(1): 36, 2022 02 10.
Artículo en Inglés | MEDLINE | ID: mdl-35139846

RESUMEN

BACKGROUND: Early detection and prediction of type two diabetes mellitus incidence by baseline measurements could reduce associated complications in the future. The low incidence rate of diabetes in comparison with non-diabetes makes accurate prediction of minority diabetes class more challenging. METHODS: Deep neural network (DNN), extremely gradient boosting (XGBoost), and random forest (RF) performance is compared in predicting minority diabetes class in Tehran Lipid and Glucose Study (TLGS) cohort data. The impact of changing threshold, cost-sensitive learning, over and under-sampling strategies as solutions to class imbalance have been compared in improving algorithms performance. RESULTS: DNN with the highest accuracy in predicting diabetes, 54.8%, outperformed XGBoost and RF in terms of AUROC, g-mean, and f1-measure in original imbalanced data. Changing threshold based on the maximum of f1-measure improved performance in g-mean, and f1-measure in three algorithms. Repeated edited nearest neighbors (RENN) under-sampling in DNN and cost-sensitive learning in tree-based algorithms were the best solutions to tackle the imbalance issue. RENN increased ROC and Precision-Recall AUCs, g-mean and f1-measure from 0.857, 0.603, 0.713, 0.575 to 0.862, 0.608, 0.773, 0.583, respectively in DNN. Weighing improved g-mean and f1-measure from 0.667, 0.554 to 0.776, 0.588 in XGBoost, and from 0.659, 0.543 to 0.775, 0.566 in RF, respectively. Also, ROC and Precision-Recall AUCs in RF increased from 0.840, 0.578 to 0.846, 0.591, respectively. CONCLUSION: G-mean experienced the most increase by all imbalance solutions. Weighing and changing threshold as efficient strategies, in comparison with resampling methods are faster solutions to handle class imbalance. Among sampling strategies, under-sampling methods had better performance than others.


Asunto(s)
Diabetes Mellitus , Aprendizaje Automático , Algoritmos , Humanos , Irán , Redes Neurales de la Computación
6.
Sensors (Basel) ; 22(18)2022 Sep 07.
Artículo en Inglés | MEDLINE | ID: mdl-36146110

RESUMEN

Aiming at the problem of class imbalance in the wind turbine blade bolts operation-monitoring dataset, a fault detection method for wind turbine blade bolts based on Gaussian Mixture Model-Synthetic Minority Oversampling Technique-Gaussian Mixture Model (GSG) combined with Cost-Sensitive LightGBM (CS-LightGBM) was proposed. Since it is difficult to obtain the fault samples of blade bolts, the GSG oversampling method was constructed to increase the fault samples in the blade bolt dataset. The method obtains the optimal number of clusters through the BIC criterion, and uses the GMM based on the optimal number of clusters to optimally cluster the fault samples in the blade bolt dataset. According to the density distribution of fault samples in inter-clusters, we synthesized new fault samples using SMOTE in an intra-cluster. This retains the distribution characteristics of the original fault class samples. Then, we used the GMM with the same initial cluster center to cluster the fault class samples that were added to new samples, and removed the synthetic fault class samples that were not clustered into the corresponding clusters. Finally, the synthetic data training set was used to train the CS-LightGBM fault detection model. Additionally, the hyperparameters of CS-LightGBM were optimized by the Bayesian optimization algorithm to obtain the optimal CS-LightGBM fault detection model. The experimental results show that compared with six models including SMOTE-LightGBM, CS-LightGBM, K-means-SMOTE-LightGBM, etc., the proposed fault detection model is superior to the other comparison methods in the false alarm rate, missing alarm rate and F1-score index. The method can well realize the fault detection of large wind turbine blade bolts.

7.
Sensors (Basel) ; 22(11)2022 May 27.
Artículo en Inglés | MEDLINE | ID: mdl-35684694

RESUMEN

Arrhythmia detection algorithms based on deep learning are attracting considerable interest due to their vital role in the diagnosis of cardiac abnormalities. Despite this interest, deep feature representation for ECG is still challenging and intriguing due to the inter-patient variability of the ECG's morphological characteristics. The aim of this study was to learn a balanced deep feature representation that incorporates both the short-term and long-term morphological characteristics of ECG beats. For efficient feature extraction, we designed a temporal transition module that uses convolutional layers with different kernel sizes to capture a wide range of morphological patterns. Imbalanced data are a key issue in developing an efficient and generalized model for arrhythmia detection as they cause over-fitting to minority class samples (abnormal beats) of primary interest. To mitigate the imbalanced data issue, we proposed a novel, cost-sensitive loss function that ensures a balanced deep representation of class samples by assigning effective weights to each class. The cost-sensitive loss function dynamically alters class weights for every batch based on class distribution and model performance. The proposed method acquired an overall accuracy of 99.81% for intra-patient classification and 96.36% for the inter-patient classification of heartbeats. The experimental results reveal that the proposed approach learned a balanced representation of ECG beats by mitigating the issue of imbalanced data and achieved an improved classification performance as compared to other studies.


Asunto(s)
Electrocardiografía , Redes Neurales de la Computación , Algoritmos , Arritmias Cardíacas/diagnóstico , Electrocardiografía/métodos , Frecuencia Cardíaca , Humanos
8.
Entropy (Basel) ; 24(2)2022 Feb 08.
Artículo en Inglés | MEDLINE | ID: mdl-35205547

RESUMEN

Early diagnosis of cancer is beneficial in the formulation of the best treatment plan; it can improve the survival rate and the quality of patient life. However, imaging detection and needle biopsy usually used not only find it difficult to effectively diagnose tumors at early stage, but also do great harm to the human body. Since the changes in a patient's health status will cause changes in blood protein indexes, if cancer can be diagnosed by the changes in blood indexes in the early stage of cancer, it can not only conveniently track and detect the treatment process of cancer, but can also reduce the pain of patients and reduce the costs. In this paper, 39 serum protein markers were taken as research objects. The difference of the entropies of serum protein marker sequences in different types of patients was analyzed, and based on this, a cost-sensitive analysis model was established for the purpose of improving the accuracy of cancer recognition. The results showed that there were significant differences in entropy of different cancer patients, and the complexity of serum protein markers in normal people was higher than that in cancer patients. Although the dataset was rather imbalanced, containing 897 instances, including 799 normal instances, 44 liver cancer instances, and 54 ovarian cancer instances, the accuracy of our model still reached 95.21%. Other evaluation indicators were also stable and satisfactory; precision, recall, F1 and AUC reach 0.807, 0.833, 0.819 and 0.92, respectively. This study has certain theoretical and practical significance for cancer prediction and clinical application and can also provide a research basis for the intelligent medical treatment.

9.
BMC Med Inform Decis Mak ; 21(1): 73, 2021 02 25.
Artículo en Inglés | MEDLINE | ID: mdl-33632225

RESUMEN

BACKGROUND: Heart disease is the primary cause of morbidity and mortality in the world. It includes numerous problems and symptoms. The diagnosis of heart disease is difficult because there are too many factors to analyze. What's more, the misclassification cost could be very high. METHODS: A cost-sensitive ensemble method was proposed to improve the efficiency of diagnosis and reduce the misclassification cost. The proposed method contains five heterogeneous classifiers: random forest, logistic regression, support vector machine, extreme learning machine and k-nearest neighbor. T-test was used to investigate if the performance of the ensemble was better than individual classifiers and the contribution of Relief algorithm. RESULTS: The best performance was achieved by the proposed method according to ten-fold cross validation. The statistical tests demonstrated that the performance of the proposed ensemble was significantly superior to individual classifiers, and the efficiency of classification was distinctively improved by Relief algorithm. CONCLUSIONS: The proposed ensemble gained significantly better results compared with individual classifiers and previous studies, which implies that it can be used as a promising alternative tool in medical decision making for heart disease diagnosis.


Asunto(s)
Algoritmos , Cardiopatías , Análisis por Conglomerados , Cardiopatías/diagnóstico , Humanos , Modelos Logísticos , Máquina de Vectores de Soporte
10.
BMC Bioinformatics ; 20(Suppl 25): 681, 2019 Dec 24.
Artículo en Inglés | MEDLINE | ID: mdl-31874599

RESUMEN

BACKGROUND: Cost-sensitive algorithm is an effective strategy to solve imbalanced classification problem. However, the misclassification costs are usually determined empirically based on user expertise, which leads to unstable performance of cost-sensitive classification. Therefore, an efficient and accurate method is needed to calculate the optimal cost weights. RESULTS: In this paper, two approaches are proposed to search for the optimal cost weights, targeting at the highest weighted classification accuracy (WCA). One is the optimal cost weights grid searching and the other is the function fitting. Comparisons are made between these between the two algorithms above. In experiments, we classify imbalanced gene expression data using extreme learning machine to test the cost weights obtained by the two approaches. CONCLUSIONS: Comprehensive experimental results show that the function fitting method is generally more efficient, which can well find the optimal cost weights with acceptable WCA.


Asunto(s)
Algoritmos , Expresión Génica , Neoplasias del Colon/genética , Neoplasias del Colon/metabolismo , Humanos , Leucemia/genética , Leucemia/metabolismo
11.
Sensors (Basel) ; 19(4)2019 Feb 14.
Artículo en Inglés | MEDLINE | ID: mdl-30769813

RESUMEN

Significant progress has been achieved in the past few years for the challenging task of pedestrian detection. Nevertheless, a major bottleneck of existing state-of-the-art approaches lies in a great drop in performance with reducing resolutions of the detected targets. For the boosting-based detectors which are popular in pedestrian detection literature, a possible cause for this drop is that in their boosting training process, low-resolution samples, which are usually more difficult to be detected due to the missing details, are still treated equally importantly as high-resolution samples, resulting in the false negatives since they are more easily rejected in the early stages and can hardly be recovered in the late stages. To address this problem, we propose in this paper a robust multi-resolution detection approach with a novel group cost-sensitive boosting algorithm, which is derived from the standard AdaBoost algorithm to further explore different costs for different resolution groups of the samples in the boosting process, and to place greater emphasis on low-resolution groups in order to better handle the detection of multi-resolution targets. The effectiveness of the proposed approach is evaluated on the Caltech pedestrian benchmark and KAIST (Korea Advanced Institute of Science and Technology) multispectral pedestrian benchmark, and validated by its promising performance on different resolution-specific test sets of both benchmarks.

12.
J Med Syst ; 43(8): 251, 2019 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-31254110

RESUMEN

With the development of theories and technologies in medical imaging, most of the tumors can be detected in the early stage. However, the nature of ovarian cysts lacks accurate judgement, leading to that many patients with benign nodules still need Fine Needle Aspiration (FNA) biopsies or surgeries, increasing the physical pain and mental pressure of patients as well as unnecessary medical health care costs. Therefore, we present an image diagnosis system for classifying the ovarian cysts in color ultrasound images, which novelly applies the image features fused by both high-level features from deep learning network and low-level features from texture descriptor. Firstly, the ultrasound images are enhanced to improve the quality of training data set and the rotation invariant uniform local binary pattern (ULBP) features are extracted from each of the images as the low-level texture features. Then the high-level deep features extracted by the fine-tuned GoogLeNet neural network and the low-level ULBP features are normalized and cascaded as one fusion feature that can represent both the semantic context and the texture patterns distributed in the image. Finally, the fusion features are input to the Cost-sensitive Random Forest classifier to classify the images into "malignant" and "benign". The high-level features extracted by the deep neural network from the medical ultrasound image can reflect the visual features of the lesion region, while the low-level texture features can describe the edges, direction and distribution of intensities. Experimental results indicate that the combination of the two types of features can describe the differences between the lesion regions and other regions, and the differences between lesions regions of malignant and benign ovarian cysts.


Asunto(s)
Aprendizaje Profundo , Detección Precoz del Cáncer , Redes Neurales de la Computación , Neoplasias Ováricas/diagnóstico , Ultrasonografía/métodos , Diagnóstico por Computador , Femenino , Humanos
13.
J Comput Aided Mol Des ; 32(5): 583-590, 2018 05.
Artículo en Inglés | MEDLINE | ID: mdl-29626291

RESUMEN

Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to work best on balanced datasets. The class imbalance introduces a bias in the performance of these algorithms due to their preference towards the majority class. Here, we present a comparison of the performance of seven different meta-classifiers for their ability to handle imbalanced datasets, whereby Random Forest is used as base-classifier. Four different datasets that are directly (cholestasis) or indirectly (via inhibition of organic anion transporting polypeptide 1B1 and 1B3) related to liver toxicity were chosen for this purpose. The imbalance ratio in these datasets ranges between 4:1 and 20:1 for negative and positive classes, respectively. Three different sets of molecular descriptors for model development were used, and their performance was assessed in 10-fold cross-validation and on an independent validation set. Stratified bagging, MetaCost and CostSensitiveClassifier were found to be the best performing among all the methods. While MetaCost and CostSensitiveClassifier provided better sensitivity values, Stratified Bagging resulted in high balanced accuracies.


Asunto(s)
Simulación por Computador , Conjuntos de Datos como Asunto , Hígado/efectos de los fármacos , Algoritmos , Animales , Colestasis/inducido químicamente , Humanos , Hígado/metabolismo , Aprendizaje Automático , Proteína 1 de Transporte de Anión Orgánico/antagonistas & inhibidores
14.
Entropy (Basel) ; 20(4)2018 Apr 05.
Artículo en Inglés | MEDLINE | ID: mdl-33265345

RESUMEN

Aim: Currently, identifying multiple sclerosis (MS) by human experts may come across the problem of "normal-appearing white matter", which causes a low sensitivity. Methods: In this study, we presented a computer vision based approached to identify MS in an automatic way. This proposed method first extracted the fractional Fourier entropy map from a specified brain image. Afterwards, it sent the features to a multilayer perceptron trained by a proposed improved parameter-free Jaya algorithm. We used cost-sensitivity learning to handle the imbalanced data problem. Results: The 10 × 10-fold cross validation showed our method yielded a sensitivity of 97.40 ± 0.60%, a specificity of 97.39 ± 0.65%, and an accuracy of 97.39 ± 0.59%. Conclusions: We validated by experiments that the proposed improved Jaya performs better than plain Jaya algorithm and other latest bioinspired algorithms in terms of classification performance and training speed. In addition, our method is superior to four state-of-the-art MS identification approaches.

15.
Biomed Eng Online ; 16(1): 132, 2017 Nov 21.
Artículo en Inglés | MEDLINE | ID: mdl-29157240

RESUMEN

BACKGROUND: Ocular images play an essential role in ophthalmological diagnoses. Having an imbalanced dataset is an inevitable issue in automated ocular diseases diagnosis; the scarcity of positive samples always tends to result in the misdiagnosis of severe patients during the classification task. Exploring an effective computer-aided diagnostic method to deal with imbalanced ophthalmological dataset is crucial. METHODS: In this paper, we develop an effective cost-sensitive deep residual convolutional neural network (CS-ResCNN) classifier to diagnose ophthalmic diseases using retro-illumination images. First, the regions of interest (crystalline lens) are automatically identified via twice-applied Canny detection and Hough transformation. Then, the localized zones are fed into the CS-ResCNN to extract high-level features for subsequent use in automatic diagnosis. Second, the impacts of cost factors on the CS-ResCNN are further analyzed using a grid-search procedure to verify that our proposed system is robust and efficient. RESULTS: Qualitative analyses and quantitative experimental results demonstrate that our proposed method outperforms other conventional approaches and offers exceptional mean accuracy (92.24%), specificity (93.19%), sensitivity (89.66%) and AUC (97.11%) results. Moreover, the sensitivity of the CS-ResCNN is enhanced by over 13.6% compared to the native CNN method. CONCLUSION: Our study provides a practical strategy for addressing imbalanced ophthalmological datasets and has the potential to be applied to other medical images. The developed and deployed CS-ResCNN could serve as computer-aided diagnosis software for ophthalmologists in clinical application.


Asunto(s)
Análisis Costo-Beneficio , Diagnóstico por Computador/economía , Diagnóstico por Imagen , Oftalmopatías/diagnóstico por imagen , Procesamiento de Imagen Asistido por Computador/métodos , Redes Neurales de la Computación , Automatización , Programas Informáticos
16.
Biom J ; 59(5): 948-966, 2017 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-28626952

RESUMEN

The classification of a population by a specific trait is a major task in medicine, for example when in a diagnostic setting groups of patients with specific diseases are identified, but also when in predictive medicine a group of patients is classified into specific disease severity classes that might profit from different treatments. When the sizes of those subgroups become small, for example in rare diseases, imbalances between the classes are more the rule than the exception and make statistical classification problematic when the error rate of the minority class is high. Many observations are classified as belonging to the majority class, while the error rate of the majority class is low. This case study aims to investigate class imbalance for Random Forests and Powered Partial Least Squares Discriminant Analysis (PPLS-DA) and to evaluate the performance of these classifiers when they are combined with methods to compensate imbalance (sampling methods, cost-sensitive learning approaches). We evaluate all approaches with a scoring system taking the classification results into consideration. This case study is based on one high-dimensional multiplex autoimmune assay dataset describing immune response to antigens and consisting of two classes of patients: Rheumatoid Arthritis (RA) and Systemic Lupus Erythemathodes (SLE). Datasets with varying degrees of imbalance are created by successively reducing the class of RA patients. Our results indicate possible benefit of cost-sensitive learning approaches for Random Forests. Although further research is needed to verify our findings by investigating other datasets or large-scale simulation studies, we claim that this work has the potential to increase awareness of practitioners to this problem of class imbalance and stresses the importance of considering methods to compensate class imbalance.


Asunto(s)
Biometría/métodos , Algoritmos , Artritis Reumatoide/diagnóstico , Bioensayo/normas , Simulación por Computador , Análisis Discriminante , Humanos , Lupus Eritematoso Sistémico/diagnóstico
17.
BMC Bioinformatics ; 17(Suppl 18): 472, 2016 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-28105913

RESUMEN

BACKGROUND: This work presents a machine learning strategy to increase sensitivity in tandem mass spectrometry (MS/MS) data analysis for peptide/protein identification. MS/MS yields thousands of spectra in a single run which are then interpreted by software. Most of these computer programs use a protein database to match peptide sequences to the observed spectra. The peptide-spectrum matches (PSMs) must also be assessed by computational tools since manual evaluation is not practicable. The target-decoy database strategy is largely used for error estimation in PSM assessment. However, in general, that strategy does not account for sensitivity. RESULTS: In a previous study, we proposed the method MUMAL that applies an artificial neural network to effectively generate a model to classify PSMs using decoy hits with increased sensitivity. Nevertheless, the present approach shows that the sensitivity can be further improved with the use of a cost matrix associated with the learning algorithm. We also demonstrate that using a threshold selector algorithm for probability adjustment leads to more coherent probability values assigned to the PSMs. Our new approach, termed MUMAL2, provides a two-fold contribution to shotgun proteomics. First, the increase in the number of correctly interpreted spectra in the peptide level augments the chance of identifying more proteins. Second, the more appropriate PSM probability values that are produced by the threshold selector algorithm impact the protein inference stage performed by programs that take probabilities into account, such as ProteinProphet. Our experiments demonstrate that MUMAL2 reached around 15% of improvement in sensitivity compared to the best current method. Furthermore, the area under the ROC curve obtained was 0.93, demonstrating that the probabilities generated by our model are in fact appropriate. Finally, Venn diagrams comparing MUMAL2 with the best current method show that the number of exclusive peptides found by our method was nearly 4-fold higher, which directly impacts the proteome coverage. CONCLUSIONS: The inclusion of a cost matrix and a probability threshold selector algorithm to the learning task further improves the target-decoy database analysis for identifying peptides, which optimally contributes to the challenging task of protein level identification, resulting in a powerful computational tool for shotgun proteomics.


Asunto(s)
Redes Neurales de la Computación , Proteómica/métodos , Algoritmos , Bases de Datos de Proteínas/economía , Péptidos/química , Probabilidad , Proteoma/química , Proteómica/economía , Programas Informáticos , Espectrometría de Masas en Tándem/métodos
18.
Mol Divers ; 20(1): 93-109, 2016 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-26643659

RESUMEN

In many absorption, distribution, metabolism, and excretion (ADME) modeling problems, imbalanced data could negatively affect classification performance of machine learning algorithms. Solutions for handling imbalanced dataset have been proposed, but their application for ADME modeling tasks is underexplored. In this paper, various strategies including cost-sensitive learning and resampling methods were studied to tackle the moderate imbalance problem of a large Caco-2 cell permeability database. Simple physicochemical molecular descriptors were utilized for data modeling. Support vector machine classifiers were constructed and compared using multiple comparison tests. Results showed that the models developed on the basis of resampling strategies displayed better performance than the cost-sensitive classification models, especially in the case of oversampling data where misclassification rates for minority class have values of 0.11 and 0.14 for training and test set, respectively. A consensus model with enhanced applicability domain was subsequently constructed and showed improved performance. This model was used to predict a set of randomly selected high-permeability reference drugs according to the biopharmaceutics classification system. Overall, this study provides a comparison of numerous rebalancing strategies and displays the effectiveness of oversampling methods to deal with imbalanced permeability data problems.


Asunto(s)
Modelos Biológicos , Células CACO-2 , Bases de Datos Factuales , Humanos , Aprendizaje Automático , Permeabilidad , Máquina de Vectores de Soporte
19.
J Dairy Sci ; 98(6): 3717-28, 2015 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-25841967

RESUMEN

The common practice on most commercial dairy farms is to inseminate all cows that are eligible for breeding, while ignoring (or absorbing) the costs associated with semen and labor directed toward low-fertility cows that are unlikely to conceive. Modern analytical methods, such as machine learning algorithms, can be applied to cow-specific explanatory variables for the purpose of computing probabilities of success or failure associated with upcoming insemination events. Lift chart analysis can identify subsets of high fertility cows that are likely to conceive and are therefore appropriate targets for insemination (e.g., with conventional artificial insemination semen or expensive sex-enhanced semen), as well as subsets of low-fertility cows that are unlikely to conceive and should therefore be passed over at that point in time. Although such a strategy might be economically viable, the management, environmental, and financial conditions on one farm might differ widely from conditions on the next, and hence the reproductive management recommendations derived from such a tool may be suboptimal for specific farms. When coupled with cost-sensitive evaluation of misclassified and correctly classified insemination events, the strategy can be a potentially powerful tool for optimizing the reproductive management of individual farms. In the present study, lift chart analysis and cost-sensitive evaluation were applied to a data set consisting of 54,806 insemination events of primiparous Holstein cows on 26 Wisconsin farms, as well as a data set with 17,197 insemination events of primiparous Holstein cows on 3 Wisconsin farms, where the latter had more detailed information regarding health events of individual cows. In the first data set, the gains in profit associated with limiting inseminations to subsets of 79 to 97% of the most fertile eligible cows ranged from $0.44 to $2.18 per eligible cow in a monthly breeding period, depending on days in milk at breeding and milk yield relative to contemporaries. In the second data set, the strategy of inseminating only a subset consisting of 59% of the most fertile cows conferred a gain in profit of $5.21 per eligible cow in a monthly breeding period. These results suggest that, when used with a powerful classification algorithm, lift chart analysis and cost-sensitive evaluation of correctly classified and misclassified insemination events can enhance the performance and profitability of reproductive management programs on commercial dairy farms.


Asunto(s)
Inseminación Artificial/veterinaria , Reproducción/fisiología , Algoritmos , Animales , Cruzamiento , Bovinos , Costos y Análisis de Costo , Industria Lechera/métodos , Femenino , Fertilidad , Fertilización , Masculino , Leche/economía , Paridad , Embarazo , Semen , Wisconsin
20.
Sci Rep ; 14(1): 18625, 2024 08 11.
Artículo en Inglés | MEDLINE | ID: mdl-39128903

RESUMEN

The COVID-19 pandemic has imposed significant challenges on global health, emphasizing the persistent threat of large-scale infectious diseases in the future. This study addresses the need to enhance pooled testing efficiency for large populations. The common approach in pooled testing involves consolidating multiple test samples into a single tube to efficiently detect positivity at a lower cost. However, what is the optimal number of samples to be grouped together in order to minimize costs? i.e. allocating ten individuals per group may not be the most cost-effective strategy. In response, this paper introduces the hierarchical quotient space, an extension of fuzzy equivalence relations, as a method to optimize group allocations. In this study, we propose a cost-sensitive multi-granularity intelligent decision model to further minimize testing costs. This model considers both testing and collection costs, aiming to achieve the lowest total cost through optimal grouping at a single layer. Building upon this foundation, two multi-granularity models are proposed, exploring hierarchical group optimization. The experimental simulations were conducted using MATLAB R2022a on a desktop with Intel i5-10500 CPU and 8G RAM, considering scenarios with a fixed number of individuals and fixed positive probability. The main findings from our simulations demonstrate that the proposed models significantly enhance the efficiency and reduce the overall costs associated with pooled testing. For example, testing costs were reduced by nearly half when the optimal grouping strategy was applied, compared to the traditional method of grouping ten individuals. Additionally, the multi-granularity approach further optimized the hierarchical groupings, leading to substantial cost savings and improved testing efficiency.


Asunto(s)
COVID-19 , Análisis Costo-Beneficio , Humanos , COVID-19/epidemiología , COVID-19/diagnóstico , COVID-19/economía , COVID-19/virología , SARS-CoV-2/aislamiento & purificación , Prueba de COVID-19/métodos , Prueba de COVID-19/economía , Pandemias/economía , Técnicas de Apoyo para la Decisión
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA