Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 96
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37039664

RESUMO

Single-cell ribonucleic acid sequencing (scRNA-seq) enables the quantification of gene expression at the transcriptomic level with single-cell resolution, enhancing our understanding of cellular heterogeneity. However, the excessive missing values present in scRNA-seq data hinder downstream analysis. While numerous imputation methods have been proposed to recover scRNA-seq data, high imputation performance often comes with low or no interpretability. Here, we present IGSimpute, an accurate and interpretable imputation method for recovering missing values in scRNA-seq data with an interpretable instance-wise gene selection layer (GSL). IGSimpute outperforms 12 other state-of-the-art imputation methods on 13 out of 17 datasets from different scRNA-seq technologies with the lowest mean squared error as the chosen benchmark metric. We demonstrate that IGSimpute can give unbiased estimates of the missing values compared to other methods, regardless of whether the average gene expression values are small or large. Clustering results of imputed profiles show that IGSimpute offers statistically significant improvement over other imputation methods. By taking the heart-and-aorta and the limb muscle tissues as examples, we show that IGSimpute can also denoise gene expression profiles by removing outlier entries with unexpectedly high expression values via the instance-wise GSL. We also show that genes selected by the instance-wise GSL could indicate the age of B cells from bladder fat tissue of the Tabula Muris Senis atlas. IGSimpute can impute one million cells using 64 min, and thus applicable to large datasets.


Assuntos
Análise da Expressão Gênica de Célula Única , Software , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Perfilação da Expressão Gênica , Transcriptoma , Análise por Conglomerados
2.
Methods ; 227: 17-26, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38705502

RESUMO

Messenger RNA (mRNA) is vital for post-transcriptional gene regulation, acting as the direct template for protein synthesis. However, the methods available for predicting mRNA subcellular localization need to be improved and enhanced. Notably, few existing algorithms can annotate mRNA sequences with multiple localizations. In this work, we propose the mRNA-CLA, an innovative multi-label subcellular localization prediction framework for mRNA, leveraging a deep learning approach with a multi-head self-attention mechanism. The framework employs a multi-scale convolutional layer to extract sequence features across different regions and uses a self-attention mechanism explicitly designed for each sequence. Paired with Position Weight Matrices (PWMs) derived from the convolutional neural network layers, our model offers interpretability in the analysis. In particular, we perform a base-level analysis of mRNA sequences from diverse subcellular localizations to determine the nucleotide specificity corresponding to each site. Our evaluations demonstrate that the mRNA-CLA model substantially outperforms existing methods and tools.


Assuntos
Aprendizado Profundo , RNA Mensageiro , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Biologia Computacional/métodos , Redes Neurais de Computação , Humanos , Algoritmos
3.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36209437

RESUMO

Long non-coding RNA (lncRNA) plays important roles in a series of biological processes. The transcription of lncRNA is regulated by its promoter. Hence, accurate identification of lncRNA promoter will be helpful to understand its regulatory mechanisms. Since experimental techniques remain time consuming for gnome-wide promoter identification, developing computational tools to identify promoters are necessary. However, only few computational methods have been proposed for lncRNA promoter prediction and their performances still have room to be improved. In the present work, a convolutional neural network based model, called DeepLncPro, was proposed to identify lncRNA promoters in human and mouse. Comparative results demonstrated that DeepLncPro was superior to both state-of-the-art machine learning methods and existing models for identifying lncRNA promoters. Furthermore, DeepLncPro has the ability to extract and analyze transcription factor binding motifs from lncRNAs, which made it become an interpretable model. These results indicate that the DeepLncPro can server as a powerful tool for identifying lncRNA promoters. An open-source tool for DeepLncPro was provided at https://github.com/zhangtian-yang/DeepLncPro.


Assuntos
RNA Longo não Codificante , Humanos , Animais , Camundongos , RNA Longo não Codificante/genética , RNA Longo não Codificante/metabolismo , Biologia Computacional/métodos , Redes Neurais de Computação , Regiões Promotoras Genéticas , Algoritmos
4.
J Magn Reson Imaging ; 2024 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-39353848

RESUMO

BACKGROUND: Automated approaches may allow for fast, reproducible clinical assessment of cardiovascular diseases from MRI. PURPOSE: To develop an MRI-based deep learning (DL) disease classification algorithm to distinguish among normal subjects (NORM), patients with dilated cardiomyopathy (DCM), hypertrophic cardiomyopathy (HCM), and ischemic heart disease (IHD). STUDY TYPE: Retrospective. POPULATION: A total of 1337 subjects (55% female), comprising normal subjects (N = 568), and patients with DCM (N = 151), HCM (N = 177), and IHD (N = 441). FIELD STRENGTH/SEQUENCE: Balanced steady-state free precession cine sequence at 1.5/3.0 T. ASSESSMENT: Bi-ventricular morphological and functional features and global and segmental left ventricular strain features were automatically extracted from short- and long-axis cine images. Variational autoencoder models were trained on the extracted features and compared against consensus disease label provided by two expert readers (13 and 14 years of experience). Adding unlabeled, normal data to the training was explored to increase specificity of NORM class. STATISTICAL TESTS: Tenfold cross-validation for model development; mean, standard deviation (SD) for measurements; classification metrics: area under the curve (AUC), confusion matrix, accuracy, specificity, precision, recall; 95% confidence intervals; Mann-Whitney U test for significance. RESULTS: AUCs of 0.952 for NORM, 0.881 for DCM, 0.908 for HCM, and 0.856 for IHD and overall accuracy of 0.778 were obtained, with specificity of 0.908 for the NORM class using both SAX and LAX features. Longitudinal strain features slightly improved classification metrics by 0.001 to 0.03 points, except for HCM-AUC. Differences in accuracy, metrics for NORM class and HCM-AUC were statistically significant. Cotraining using unlabeled data increased the specificity for the NORM class to 0.961. DATA CONCLUSION: Cardiac function features automatically extracted from cine MRI have potential to be used for disease classification, especially for normal-abnormal classification. Feature analyses showed that strain features were important for disease labeling. Cotraining using unlabeled data may help to increase specificity for normal-abnormal classification. LEVEL OF EVIDENCE: 3 TECHNICAL EFFICACY: Stage 1.

5.
Methods ; 217: 1-9, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-37321525

RESUMO

Drug combination therapies are common practice in the treatment of cancer, but not all combinations result in synergy. As traditional screening approaches are restricted in their ability to uncover synergistic drug combinations, computer-aided medicine is becoming a increasingly prevalent in this field. In this work, a predictive model of potential interactions between drugs named MPFFPSDC is presented, which can maintain the symmetry of drug inputs and eliminate inconsistencies in predictive results caused by different drug inputting sequences or positions. The experimental results show that MPFFPSDC outperforms comparative models in major performance indicators and exhibits better generalization for independent data. Furthermore, the case study demonstrates that our model can capture molecular substructures that contribute to the synergistic effect of two drugs. These results indicate that MPFFPSDC not only offers strong predictive performance, but also has good model interpretability that may provide new insights for the study of drug interaction mechanisms and the development of new drugs.


Assuntos
Neoplasias , Humanos , Sinergismo Farmacológico , Combinação de Medicamentos , Quimioterapia Combinada , Neoplasias/tratamento farmacológico , Interações Medicamentosas
6.
Cereb Cortex ; 33(10): 5817-5828, 2023 05 09.
Artigo em Inglês | MEDLINE | ID: mdl-36843049

RESUMO

Deep learning has become an effective tool for classifying biological sex based on functional magnetic resonance imaging (fMRI). However, research on what features within the brain are most relevant to this classification is still lacking. Model interpretability has become a powerful way to understand "black box" deep-learning models, and select features within the input data that are most relevant to the correct classification. However, very little work has been done employing these methods to understand the relationship between the temporal dimension of functional imaging signals and the classification of biological sex. Consequently, less attention has been paid to rectifying problems and limitations associated with feature explanation models, e.g. underspecification and instability. In this work, we first provide a methodology to limit the impact of underspecification on the stability of the measured feature importance. Then, using intrinsic connectivity networks from fMRI data, we provide a deep exploration of sex differences among functional brain networks. We report numerous conclusions, including activity differences in the visual and cognitive domains and major connectivity differences.


Assuntos
Encéfalo , Imageamento por Ressonância Magnética , Humanos , Feminino , Masculino , Imageamento por Ressonância Magnética/métodos , Mapeamento Encefálico/métodos , Cabeça
7.
Can J Microbiol ; 70(10): 446-460, 2024 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-39079170

RESUMO

With antimicrobial resistance (AMR) rapidly evolving in pathogens, quick and accurate identification of genetic determinants of phenotypic resistance is essential for improving surveillance, stewardship, and clinical mitigation. Machine learning (ML) models show promise for AMR prediction in diagnostics but require a deep understanding of internal processes to use effectively. Our study utilised AMR gene, pangenomic, and predicted plasmid features from 647 Enterococcus faecium and Enterococcus faecalis genomes across the One Health continuum, along with corresponding resistance phenotypes, to develop interpretive ML classifiers. Vancomycin resistance could be predicted with 99% accuracy with AMR gene features, 98% with pangenome features, and 96% with plasmid clusters. Top pangenome features overlapped with the resistance genes of the vanA operon, which are often laterally transmitted via plasmids. Doxycycline resistance prediction achieved approximately 92% accuracy with pangenome features, with the top feature being elements of Tn916 conjugative transposon, a tet(M) carrier. Erythromycin resistance prediction models achieved about 90% accuracy, but top features were negatively correlated with resistance due to the confounding effect of population structure. This work demonstrates the importance of reviewing ML models' features to discern biological relevance even when achieving high-performance metrics. Our workflow offers the potential to propose hypotheses for experimental testing, enhancing the understanding of AMR mechanisms, which are crucial for combating the AMR crisis.


Assuntos
Antibacterianos , Farmacorresistência Bacteriana , Enterococcus faecalis , Enterococcus faecium , Genoma Bacteriano , Aprendizado de Máquina , Plasmídeos , Enterococcus faecalis/genética , Enterococcus faecalis/efeitos dos fármacos , Enterococcus faecium/genética , Enterococcus faecium/efeitos dos fármacos , Antibacterianos/farmacologia , Farmacorresistência Bacteriana/genética , Plasmídeos/genética , Humanos , Testes de Sensibilidade Microbiana , Infecções por Bactérias Gram-Positivas/microbiologia , Proteínas de Bactérias/genética
8.
J Biopharm Stat ; : 1-14, 2024 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-38860696

RESUMO

Accurate prediction of a rare and clinically important event following study treatment has been crucial in drug development. For instance, the rarity of an adverse event is often commensurate with the seriousness of medical consequences, and delayed detection of the rare adverse event can pose significant or even life-threatening health risks to patients. In this machine learning case study, we demonstrate with an example originated from a real clinical trial setting how to define and solve the rare clinical event prediction problem using machine learning in pharmaceutical industry. The unique contributions of this work include the proposal of a six-step investigation framework that facilitates the communication with non-technical stakeholders and the interpretation of the model performance in terms of practical consequences in the context of patient screenings for conducting a future clinical trial. In terms of machine learning methodology, for data splitting into the training and test sets, we adapt the rare-event stratified split approach (from scikit-learn) to further account for group splitting for multiple records of a patient simultaneously. To handle imbalanced data due to rare events in model training, the cost-sensitive learning approach is employed to give more weights to the minor class and the metrics precision together with recall are used to capture prediction performance instead of the raw accuracy rate. Finally, we demonstrate how to apply the state-of-the-art SHAP values to identify important risk factors to improve model interpretability.

9.
J Appl Toxicol ; 44(6): 892-907, 2024 06.
Artigo em Inglês | MEDLINE | ID: mdl-38329145

RESUMO

The accurate identification of chemicals with ocular toxicity is of paramount importance in health hazard assessment. In contemporary chemical toxicology, there is a growing emphasis on refining, reducing, and replacing animal testing in safety evaluations. Therefore, the development of robust computational tools is crucial for regulatory applications. The performance of predictive models is heavily reliant on the quality and quantity of data. In this investigation, we amalgamated the most extensive dataset (4901 compounds) sourced from governmental GHS-compliant databases and literature to develop binary classification models of chemical ocular toxicity. We employed 12 molecular representations in conjunction with six machine learning algorithms and two deep learning algorithms to create a series of binary classification models. The findings indicated that the deep learning method GCN outperformed the machine learning models in cross-validation, achieving an impressive AUC of 0.915. However, the top-performing machine learning model (RF-Descriptor) demonstrated excellent performance with an AUC of 0.869 on the test set and was therefore selected as the best model. To enhance model interpretability, we conducted the SHAP method and attention weights analysis. The two approaches offered visual depictions of the relevance of key descriptors and substructures in predicting ocular toxicity of chemicals. Thus, we successfully struck a delicate balance between data quality and model interpretability, rendering our model valuable for predicting and comprehending potential ocular-toxic compounds in the early stages of drug discovery.


Assuntos
Simulação por Computador , Aprendizado Profundo , Aprendizado de Máquina , Humanos , Olho/efeitos dos fármacos , Bases de Dados Factuais , Animais , Algoritmos
10.
Sensors (Basel) ; 24(16)2024 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-39204919

RESUMO

With the rapid advancement of the Internet of Things, network security has garnered increasing attention from researchers. Applying deep learning (DL) has significantly enhanced the performance of Network Intrusion Detection Systems (NIDSs). However, due to its complexity and "black box" problem, deploying DL-based NIDS models in practical scenarios poses several challenges, including model interpretability and being lightweight. Feature selection (FS) in DL models plays a crucial role in minimizing model parameters and decreasing computational overheads while enhancing NIDS performance. Hence, selecting effective features remains a pivotal concern for NIDSs. In light of this, this paper proposes an interpretable feature selection method for encrypted traffic intrusion detection based on SHAP and causality principles. This approach utilizes the results of model interpretation for feature selection to reduce feature count while ensuring model reliability. We evaluate and validate our proposed method on two public network traffic datasets, CICIDS2017 and NSL-KDD, employing both a CNN and a random forest (RF). Experimental results demonstrate superior performance achieved by our proposed method.

11.
J Environ Manage ; 366: 121921, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-39053375

RESUMO

Machine learning models are often viewed as black boxes in landslide susceptibility assessment, lacking an analysis of how input features predict outcomes. This makes it challenging to understand the mechanisms and key factors behind landslides. To enhance the interpretability of machine learning models in wide-area landslide susceptibility assessments, this study uses the Shapely method to explore the contributions of feature factors from local, global, and spatial perspectives. Landslide susceptibility assessments were conducted using random forest (RF), support vector machine (SVM), and eXtreme Gradient Boosting (XGBoost) models, focusing on the geologically complex Sichuan-Tibet region. Initially, the study revealed the contributions of specific key feature factors to landslides from a local perspective. It then examines the overall impact of interactions among feature factors on landslide occurrence globally. Finally, it unveils the spatial distribution patterns of the contributions of various feature factors to landslide occurrence. The analysis indicates the following: (1) The XGBoost model excels in landslide susceptibility assessment, achieving accuracy, precision, recall, F1-score, and AUC values of 0.7815, 0.7858, 0.7962, 0.7910, and 0.86, respectively; (2) The Shapely method identifies the leading factors for landslides in the Sichuan-Tibet region as Elevation (3000-4000 m), PGA (1-2 g), NDVI (<0.5), and distance to rivers (<3 km); (3) Using the Shapely method, the study explains the contributions, interaction mechanisms, and spatial distribution patterns of landslide susceptibility feature factors across local, global, and spatial perspectives. These findings offer new avenues and methods for the in-depth exploration and scientific prediction of landslide risks.


Assuntos
Deslizamentos de Terra , Tibet , Aprendizado de Máquina , Máquina de Vetores de Suporte , China
12.
BMC Bioinformatics ; 24(1): 434, 2023 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-37968615

RESUMO

BACKGROUND: In the field of biology and medicine, the interpretability and accuracy are both important when designing predictive models. The interpretability of many machine learning models such as neural networks is still a challenge. Recently, many researchers utilized prior information such as biological pathways to develop neural networks-based methods, so as to provide some insights and interpretability for the models. However, the prior biological knowledge may be incomplete and there still exists some unknown information to be explored. RESULTS: We proposed a novel method, named PathExpSurv, to gain an insight into the black-box model of neural network for cancer survival analysis. We demonstrated that PathExpSurv could not only incorporate the known prior information into the model, but also explore the unknown possible expansion to the existing pathways. We performed downstream analyses based on the expanded pathways and successfully identified some key genes associated with the diseases and original pathways. CONCLUSIONS: Our proposed PathExpSurv is a novel, effective and interpretable method for survival analysis. It has great utility and value in medical diagnosis and offers a promising framework for biological research.


Assuntos
Conhecimento , Medicina , Aprendizado de Máquina , Análise de Sobrevida , Estudos de Associação Genética
13.
Neuroimage ; 276: 120209, 2023 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-37269957

RESUMO

Electroencephalography (EEG)-based brain-computer interfaces (BCIs) pose a challenge for decoding due to their low spatial resolution and signal-to-noise ratio. Typically, EEG-based recognition of activities and states involves the use of prior neuroscience knowledge to generate quantitative EEG features, which may limit BCI performance. Although neural network-based methods can effectively extract features, they often encounter issues such as poor generalization across datasets, high predicting volatility, and low model interpretability. To address these limitations, we propose a novel lightweight multi-dimensional attention network, called LMDA-Net. By incorporating two novel attention modules designed specifically for EEG signals, the channel attention module and the depth attention module, LMDA-Net is able to effectively integrate features from multiple dimensions, resulting in improved classification performance across various BCI tasks. LMDA-Net was evaluated on four high-impact public datasets, including motor imagery (MI) and P300-Speller, and was compared with other representative models. The experimental results demonstrate that LMDA-Net outperforms other representative methods in terms of classification accuracy and predicting volatility, achieving the highest accuracy in all datasets within 300 training epochs. Ablation experiments further confirm the effectiveness of the channel attention module and the depth attention module. To facilitate an in-depth understanding of the features extracted by LMDA-Net, we propose class-specific neural network feature interpretability algorithms that are suitable for evoked responses and endogenous activities. By mapping the output of the specific layer of LMDA-Net to the time or spatial domain through class activation maps, the resulting feature visualizations can provide interpretable analysis and establish connections with EEG time-spatial analysis in neuroscience. In summary, LMDA-Net shows great potential as a general decoding model for various EEG tasks.


Assuntos
Interfaces Cérebro-Computador , Humanos , Redes Neurais de Computação , Algoritmos , Eletroencefalografia/métodos , Generalização Psicológica , Imaginação/fisiologia
14.
Brief Bioinform ; 22(2): 2126-2140, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-32363397

RESUMO

Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing 'Black-box' approaches that are unable to reveal causal relationships from large amounts of initially encoded features.


Assuntos
Escherichia coli/genética , Aprendizado de Máquina , Regiões Promotoras Genéticas , Conjuntos de Dados como Assunto , Genes Bacterianos , Reprodutibilidade dos Testes
15.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33152766

RESUMO

Origins of replication sites (ORIs), which refers to the initiative locations of genomic DNA replication, play essential roles in DNA replication process. Detection of ORIs' distribution in genome scale is one of key steps to in-depth understanding their regulation mechanisms. In this study, we presented a novel machine learning-based approach called Stack-ORI encompassing 10 cell-specific prediction models for identifying ORIs from four different eukaryotic species (Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana). For each cell-specific model, we employed 12 feature encoding schemes that cover nucleic acid composition, position-specific and physicochemical properties information. The optimal feature set was identified from each encoding individually and developed their respective baseline models using the eXtreme Gradient Boosting (XGBoost) classifier. Subsequently, the predicted scores of 12 baseline models are integrated as a novel feature vector to train XGBoost and develop the final model. Extensive experimental results show that Stack-ORI achieves significantly better performance as compared with their baseline models on both training and independent datasets. Interestingly, Stack-ORI consistently outperforms existing predictor in all cell-specific models, not only on training but also on independent test. Moreover, our novel approach provides necessary interpretations that help understanding model success by leveraging the powerful SHapley Additive exPlanation algorithm, thus underlining the most important feature encoding schemes significant for predicting cell-specific ORIs.


Assuntos
Bases de Dados de Ácidos Nucleicos , Modelos Genéticos , Origem de Replicação , Máquina de Vetores de Suporte , Transcrição Gênica , Animais , Drosophila melanogaster , Humanos , Camundongos
16.
Ecotoxicol Environ Saf ; 259: 115052, 2023 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-37224784

RESUMO

Owing to the rapid development of big data technology, use of machine learning methods to identify soil pollution of potentially contaminated sites (PCS) at regional scales and in different industries has become a research hot spot. However, due to the difficulty in obtaining key indexes of site pollution sources and pathways, current methods have problems such as low accuracy of model predictions and insufficient scientific basis. In this study, we collected the environmental data of 199 PCS in 6 typical industries involving heavy metal and organic pollution. Then, 21 indexes based on basic information, potential for pollution from product and raw material, pollution control level, and migration capacity of soil pollutants were used to established the soil pollution identification index system. We fused the original indexes into the new feature subset with 11 indexes through the method of consolidation calculation. The new feature subset was then used to train machine learning models of random forest (RF), support vector machine (SVM), and multilayer perceptron (MLP), and tested to determine whether it improved the accuracy and precision of soil pollination identification models. The results of correlation analysis showed that the four new indexes created by feature fusion have the correlation with soil pollution is similar to the original indexes. The accuracies and precisions of three machine learning models trained on the new feature subset were 67.4%- 72.9% and 72.0%- 74.7%, which were 2.1%- 2.5% and 0.3%- 5.7% higher than these of the models trained on original indexes, respectively. When the PCS were divided into typical heavy metal and organic pollution sites according to the enterprise industries, the accuracy of the model trained on the two datasets for identifying soil heavy metal and organic pollution were significantly improve to approximately 80%. Owing to the imbalance in positive and negative samples in the prediction of soil organic pollution, the precisions of soil organic pollution identification models were 58%- 72.5%, which were significantly lower than their accuracies. According to the factors analysis based on the model interpretability of SHAP, most of the indexes of basic information, potential for pollution from product and raw material, and pollution control level had different degrees of impact on soil pollution. However, the indexes of migration capacity of soil pollutants had the least effect in the classification task of soil pollution identification of PCS. Among the indexes, traces of soil pollution, industrial utilization years/start-up time, pollution control risk scores and enterprise scale having the greatest effects on soil pollution with the mean SHAP values of 0.17-0.36, which reflected their contribution rate on soil pollution and could help to optimize the current index scoring of the technical regulation for identifying site soil pollution. This study provides a new technical method to identify soil pollution based on big data and machine learning methods, in addition to providing a reference and scientific basis for environmental management and soil pollution control of PCS.


Assuntos
Metais Pesados , Poluentes do Solo , Monitoramento Ambiental/métodos , Poluição Ambiental/análise , Metais Pesados/análise , Aprendizado de Máquina , Poluentes do Solo/análise , Solo
17.
J Arthroplasty ; 38(10): 1967-1972, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-37315634

RESUMO

BACKGROUND: Existing machine learning models that predicted prolonged lengths of stay (LOS) following primary total hip arthroplasty (THA) were limited by the small training volume and exclusion of important patient factors. This study aimed to develop machine learning models using a national-scale data set and examine their performance in predicting prolonged LOS following THA. METHODS: A total of 246,265 THAs were analyzed from a large database. Prolonged LOS was defined as exceeding the 75th percentile of all LOSs in the cohort. Candidate predictors of prolonged LOS were selected by recursive feature elimination and used to construct four machine learning models-artificial neural network, random forest, histogram-based gradient boosting, and k-nearest neighbor. The model performance was assessed by discrimination, calibration, and utility. RESULTS: All models exhibited excellent performance in discrimination (area under the receiver operating characteristic curve [AUC] = 0.72 to 0.74) and calibration (slope: 0.83 to 1.18, intercept: -0.01 to 0.11, Brier score: 0.185 to 0.192) during both training and testing sessions. The artificial neural network was the best performer with an AUC of 0.73, calibration slope of 0.99, calibration intercept of -0.01, and Brier score of 0.185. All models showed great utility by producing higher net benefits than the default treatment strategies in the decision curve analyses. Age, laboratory tests, and surgical variables were the strongest predictors of prolonged LOS. CONCLUSION: The excellent prediction performance of machine learning models demonstrated their capacity to identify patients prone to prolonged LOS. Many factors contributing to prolonged LOS can be optimized to minimize hospital stay for high-risk patients.


Assuntos
Artroplastia de Quadril , Humanos , Aprendizado de Máquina , Redes Neurais de Computação , Pacientes , Curva ROC
18.
BMC Endocr Disord ; 22(1): 214, 2022 Aug 26.
Artigo em Inglês | MEDLINE | ID: mdl-36028865

RESUMO

OBJECTIVE: The internal workings ofmachine learning algorithms are complex and considered as low-interpretation "black box" models, making it difficult for domain experts to understand and trust these complex models. The study uses metabolic syndrome (MetS) as the entry point to analyze and evaluate the application value of model interpretability methods in dealing with difficult interpretation of predictive models. METHODS: The study collects data from a chain of health examination institution in Urumqi from 2017 ~ 2019, and performs 39,134 remaining data after preprocessing such as deletion and filling. RFE is used for feature selection to reduce redundancy; MetS risk prediction models (logistic, random forest, XGBoost) are built based on a feature subset, and accuracy, sensitivity, specificity, Youden index, and AUROC value are used to evaluate the model classification performance; post-hoc model-agnostic interpretation methods (variable importance, LIME) are used to interpret the results of the predictive model. RESULTS: Eighteen physical examination indicators are screened out by RFE, which can effectively solve the problem of physical examination data redundancy. Random forest and XGBoost models have higher accuracy, sensitivity, specificity, Youden index, and AUROC values compared with logistic regression. XGBoost models have higher sensitivity, Youden index, and AUROC values compared with random forest. The study uses variable importance, LIME and PDP for global and local interpretation of the optimal MetS risk prediction model (XGBoost), and different interpretation methods have different insights into the interpretation of model results, which are more flexible in model selection and can visualize the process and reasons for the model to make decisions. The interpretable risk prediction model in this study can help to identify risk factors associated with MetS, and the results showed that in addition to the traditional risk factors such as overweight and obesity, hyperglycemia, hypertension, and dyslipidemia, MetS was also associated with other factors, including age, creatinine, uric acid, and alkaline phosphatase. CONCLUSION: The model interpretability methods are applied to the black box model, which can not only realize the flexibility of model application, but also make up for the uninterpretable defects of the model. Model interpretability methods can be used as a novel means of identifying variables that are more likely to be good predictors.


Assuntos
Síndrome Metabólica , Algoritmos , Humanos , Modelos Logísticos , Aprendizado de Máquina , Fatores de Risco
19.
Graefes Arch Clin Exp Ophthalmol ; 260(8): 2461-2473, 2022 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-35122132

RESUMO

PURPOSE: Neovascular age-related macular degeneration (nAMD) is a major global cause of blindness. Whilst anti-vascular endothelial growth factor (anti-VEGF) treatment is effective, response varies considerably between individuals. Thus, patients face substantial uncertainty regarding their future ability to perform daily tasks. In this study, we evaluate the performance of an automated machine learning (AutoML) model which predicts visual acuity (VA) outcomes in patients receiving treatment for nAMD, in comparison to a manually coded model built using the same dataset. Furthermore, we evaluate model performance across ethnic groups and analyse how the models reach their predictions. METHODS: Binary classification models were trained to predict whether patients' VA would be 'Above' or 'Below' a score of 70 one year after initiating treatment, measured using the Early Treatment Diabetic Retinopathy Study (ETDRS) chart. The AutoML model was built using the Google Cloud Platform, whilst the bespoke model was trained using an XGBoost framework. Models were compared and analysed using the What-if Tool (WIT), a novel model-agnostic interpretability tool. RESULTS: Our study included 1631 eyes from patients attending Moorfields Eye Hospital. The AutoML model (area under the curve [AUC], 0.849) achieved a highly similar performance to the XGBoost model (AUC, 0.847). Using the WIT, we found that the models over-predicted negative outcomes in Asian patients and performed worse in those with an ethnic category of Other. Baseline VA, age and ethnicity were the most important determinants of model predictions. Partial dependence plot analysis revealed a sigmoidal relationship between baseline VA and the probability of an outcome of 'Above'. CONCLUSION: We have described and validated an AutoML-WIT pipeline which enables clinicians with minimal coding skills to match the performance of a state-of-the-art algorithm and obtain explainable predictions.


Assuntos
Degeneração Macular , Degeneração Macular Exsudativa , Inibidores da Angiogênese/uso terapêutico , Humanos , Injeções Intravítreas , Aprendizado de Máquina , Degeneração Macular/tratamento farmacológico , Ranibizumab/uso terapêutico , Estudos Retrospectivos , Resultado do Tratamento , Fator A de Crescimento do Endotélio Vascular , Acuidade Visual , Degeneração Macular Exsudativa/diagnóstico , Degeneração Macular Exsudativa/tratamento farmacológico
20.
BMC Med Inform Decis Mak ; 22(1): 343, 2022 12 29.
Artigo em Inglês | MEDLINE | ID: mdl-36581881

RESUMO

BACKGROUND: We aimed to develop an early warning system for real-time sepsis prediction in the ICU by machine learning methods, with tools for interpretative analysis of the predictions. In particular, we focus on the deployment of the system in a target medical center with small historical samples. METHODS: Light Gradient Boosting Machine (LightGBM) and multilayer perceptron (MLP) were trained on Medical Information Mart for Intensive Care (MIMIC-III) dataset and then finetuned on the private Historical Database of local Ruijin Hospital (HDRJH) using transfer learning technique. The Shapley Additive Explanations (SHAP) analysis was employed to characterize the feature importance in the prediction inference. Ultimately, the performance of the sepsis prediction system was further evaluated in the real-world study in the ICU of the target Ruijin Hospital. RESULTS: The datasets comprised 6891 patients from MIMIC-III, 453 from HDRJH, and 67 from Ruijin real-world data. The area under the receiver operating characteristic curves (AUCs) for LightGBM and MLP models derived from MIMIC-III were 0.98 - 0.98 and 0.95 - 0.96 respectively on MIMIC-III dataset, and, in comparison, 0.82 - 0.86 and 0.84 - 0.87 respectively on HDRJH, from 1 to 5 h preceding. After transfer learning and ensemble learning, the AUCs of the final ensemble model were enhanced to 0.94 - 0.94 on HDRJH and to 0.86 - 0.9 in the real-world study in the ICU of the target Ruijin Hospital. In addition, the SHAP analysis illustrated the importance of age, antibiotics, net balance, and ventilation for sepsis prediction, making the model interpretable. CONCLUSIONS: Our machine learning model allows accurate real-time prediction of sepsis within 5-h preceding. Transfer learning can effectively improve the feasibility to deploy the prediction model in the target cohort, and ameliorate the model performance for external validation. SHAP analysis indicates that the role of antibiotic usage and fluid management needs further investigation. We argue that our system and methodology have the potential to improve ICU management by helping medical practitioners identify at-sepsis-risk patients and prepare for timely diagnosis and intervention. TRIAL REGISTRATION: NCT05088850 (retrospectively registered).


Assuntos
Unidades de Terapia Intensiva , Sepse , Humanos , Cuidados Críticos , Sepse/diagnóstico , Área Sob a Curva , Bases de Dados Factuais
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA