Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 88
Filter
1.
Fundam Res ; 4(4): 972-978, 2024 Jul.
Article in English | MEDLINE | ID: mdl-39156569

ABSTRACT

With the soaring generation of hazardous waste (HW) during industrialization and urbanization, HW illegal dumping continues to be an intractable global issue. Particularly in developing regions with lax regulations, it has become a major source of soil and groundwater contamination. One dominant challenge for HW illegal dumping supervision is the invisibility of dumping sites, which makes HW illegal dumping difficult to be found, thereby causing a long-term adverse impact on the environment. How to utilize the limited historic supervision records to screen the potential dumping sites in the whole region is a key challenge to be addressed. In this study, a novel machine learning model based on the positive-unlabeled (PU) learning algorithm was proposed to resolve this problem through the ensemble method which could iteratively mine the features of limited historic cases. Validation of the random forest-based PU model showed that the predicted top 30% of high-risk areas could cover 68.1% of newly reported cases in the studied region, indicating the reliability of the model prediction. This novel framework will also be promising in other environmental management scenarios to deal with numerous unknown samples based on limited prior experience.

2.
Sci Rep ; 14(1): 18852, 2024 08 14.
Article in English | MEDLINE | ID: mdl-39143135

ABSTRACT

The controversy surrounding whether serum total cholesterol is a risk factor for the graded progression of knee osteoarthritis (KOA) has prompted this study to develop an authentic prediction model using a machine learning (ML) algorithm. The objective was to investigate whether serum total cholesterol plays a significant role in the progression of KOA. This cross-sectional study utilized data from the public database DRYAD. LASSO regression was employed to identify risk factors associated with the graded progression of KOA. Additionally, six ML algorithms were utilized in conjunction with clinical features and relevant variables to construct a prediction model. The significance and ranking of variables were carefully analyzed. The variables incorporated in the model include JBS3, Diabetes, Hypertension, HDL, TC, BMI, SES, and AGE. Serum total cholesterol emerged as a significant risk factor for the graded progression of KOA in all six ML algorithms used for importance ranking. XGBoost algorithm was based on the combined best performance of the training and validation sets. The ML algorithm enables predictive modeling of risk factors for the progression of the KOA K-L classification and confirms that serum total cholesterol is an important risk factor for the progression of KOA.


Subject(s)
Cholesterol , Disease Progression , Machine Learning , Osteoarthritis, Knee , Humans , Cholesterol/blood , Osteoarthritis, Knee/blood , Male , Female , Risk Factors , Middle Aged , Cross-Sectional Studies , Aged , Algorithms
3.
J Environ Manage ; 368: 122107, 2024 Aug 09.
Article in English | MEDLINE | ID: mdl-39126840

ABSTRACT

In China, population growth and aging have partially negated the public health benefits of air pollution control measures, underscoring the ongoing need for precise PM2.5 monitoring and mapping. Despite its prevalence, the satellite-derived Aerosol Optical Depth (AOD) method for estimating PM2.5 concentrations often encounters significant spatial data gaps. Additionally, current research still needs better representation of PM2.5 spatiotemporal heterogeneity. Addressing these challenges, we developed a two-stage model employing the Extreme Gradient Boosting (XGBoost) algorithm. By incorporating improved spatiotemporal factors, we achieved high-precision and full-coverage daily 1-km PM2.5 mappings across China for the year 2020 without utilizing AOD products. Specifically, Model 1 develops improved temporal encodings and a terrain classification factor (DC), while Model 2 constructs an enhanced spatial autocorrelation term (Ps) by integrating observed and estimated values. Notably, Model 2 excelled in 10-fold sample-based cross-validation, achieving a coefficient of determination of 0.948, a mean absolute error of 3.792 µg/m³, a root mean square error of 7.144 µg/m³, and a mean relative error of 14.171%. Feature importance and Shapley Additive exPlanations (SHAP) analyses determined the relative importance of predictors in model training and outcome prediction, while correlation analysis identified strong links between improved temporal encodings, PM2.5 concentrations, and significant meteorological factors. Two-way Partial Dependence Plots (PDPs) further explored the interactions among these factors and their impact on PM2.5 levels. Compared to traditional methods, improved temporal encodings align more closely with seasonal variations and synergize more effectively with meteorological factors. Besides, the structured nature of DC aids in model training, while the improved Ps more effectively captures PM2.5's spatial autocorrelation, outperforming traditional Ps. Overall, this study effectively represents spatiotemporal information, thereby boosting model accuracy and enabling seamless large-scale PM2.5 estimations. It provides deep insights into variables and models, providing significant implications for future air pollution research.

4.
Front Genet ; 15: 1405032, 2024.
Article in English | MEDLINE | ID: mdl-39050251

ABSTRACT

Accurately predicting the binding affinities between Human Leukocyte Antigen (HLA) molecules and peptides is a crucial step in understanding the adaptive immune response. This knowledge can have important implications for the development of effective vaccines and the design of targeted immunotherapies. Existing sequence-based methods are insufficient to capture the structure information. Besides, the current methods lack model interpretability, which hinder revealing the key binding amino acids between the two molecules. To address these limitations, we proposed an interpretable graph convolutional neural network (GCNN) based prediction method named GIHP. Considering the size differences between HLA and short peptides, GIHP represent HLA structure as amino acid-level graph while represent peptide SMILE string as atom-level graph. For interpretation, we design a novel visual explanation method, gradient weighted activation mapping (Grad-WAM), for identifying key binding residues. GIHP achieved better prediction accuracy than state-of-the-art methods across various datasets. According to current research findings, key HLA-peptide binding residues mutations directly impact immunotherapy efficacy. Therefore, we verified those highlighted key residues to see whether they can significantly distinguish immunotherapy patient groups. We have verified that the identified functional residues can successfully separate patient survival groups across breast, bladder, and pan-cancer datasets. Results demonstrate that GIHP improves the accuracy and interpretation capabilities of HLA-peptide prediction, and the findings of this study can be used to guide personalized cancer immunotherapy treatment. Codes and datasets are publicly accessible at: https://github.com/sdustSu/GIHP.

5.
Endocrine ; 2024 Jun 10.
Article in English | MEDLINE | ID: mdl-38856840

ABSTRACT

OBJECTIVE: This study aimed to develop and evaluate machine-learning models for predicting the onset of overweight in adolescents aged 14‒17, utilizing easily collectible personal information. METHODS: This study was a one-year prospective cohort study. Baseline data were collected through anthropometric measurements and questionnaires, and the incidence of overweight was calculated one year later via anthropometric measurements. Predictive factors were selected through univariate analysis. Six machine-learning models were developed for predicting the onset of overweight. The SHapley Additive exPlanations (SHAP) was used for global and local interpretation of the models. RESULTS: Out of 1,241 adolescents, 204 (16.4%) were identified as overweight after one year. Nineteen features were associated with the overweight incidence in univariable analysis. Participants were randomly divided into a training group and a testing group in a 7:3 ratio. The Light Gradient Boosting Machine (LGBM) algorithm achieved outperformed other models, achieving the following metrics: Accuracy (0.956), Recall (0.812), Specificity (0.983), F1-score (0.855), AUC (0.961). Importance ranking revealed that the top 11 minimal feature set can maintain the stability of model performance. CONCLUSIONS: The onset of overweight in adolescents was accurately predicted using easily collectible personal information. The LGBM-based model exhibited superior performance. Oversampling technique notably improved model performance. The model interpretation technique provided innovative strategies for managing adolescent overweight/obesity.

6.
J Biomed Inform ; 154: 104652, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38718897

ABSTRACT

OBJECTIVES: Ischemic heart disease (IHD) is a significant contributor to global mortality and disability, imposing a substantial social and economic burden on individuals and healthcare systems. To enhance the efficient allocation of medical resources and ultimately benefit a larger population, accurate prediction of healthcare costs is crucial. METHODS: We developed an interpretable IHD hospitalization cost prediction model that integrates network analysis with machine learning. Specifically, our network-enhanced model extracts explainable features by leveraging a diagnosis-procedure concurrence network and advanced graph kernel techniques, facilitating the capture of intricate relationships between medical codes. RESULTS: The proposed model achieved an R2 of 0.804 ± 0.008 and a root mean square error (RMSE) of 17,076 ± 420 CNY on the temporal validation dataset, demonstrating comparable performance to the model employing less interpretable code embedding features (R2: 0.800 ± 0.008; RMSE: 17,279 ± 437 CNY) and the hybrid graph isomorphism network (R2: 0.802 ± 0.007; RMSE: 17,249 ± 387 CNY). The interpretation of the network-enhanced model assisted in pinpointing specific diagnoses and procedures associated with higher hospitalization costs, including acute kidney injury, permanent atrial fibrillation, intra-aortic balloon bump, and temporary pacemaker placement, among others. CONCLUSION: Our analysis results demonstrate that the proposed model strikes a balance between predictive accuracy and interpretability. It aids in identifying specific diagnoses and procedures associated with higher hospitalization costs, underscoring its potential to support intelligent management of IHD.


Subject(s)
Hospitalization , Myocardial Ischemia , Humans , Myocardial Ischemia/diagnosis , Hospitalization/economics , Machine Learning , Algorithms , Health Care Costs/statistics & numerical data , Neural Networks, Computer
7.
Chin Med ; 19(1): 71, 2024 May 15.
Article in English | MEDLINE | ID: mdl-38750482

ABSTRACT

BACKGROUND: Traditional Chinese Medicine (TCM) defines constitutions which are relevant to corresponding diseases among people. As one of the common constitutions, Yin-deficiency constitution influences a number of Chinese population in the disease onset. Therefore, accurate Yin-deficiency constitution identification is significant for disease prevention and treatment. METHODS: In this study, we collected participants with Yin-deficiency constitution and balanced constitution, separately. The least absolute shrinkage and selection operator (LASSO) and logistic regression were used to analyze genetic predictors. Four machine learning models for Yin-deficiency constitution classification with multiple combined genetic indicators were integrated to analyze and identify the optimal model and features. The Shapley Additive exPlanations (SHAP) interpretation was developed for model explanation. RESULTS: The results showed that, NFKBIA, BCL2A1 and CCL4 were the most associated genetic indicators with Yin-deficiency constitution. Random forest with three genetic predictors including NFKBIA, BCL2A1 and CCL4 was the optimal model, area under curve (AUC): 0.937 (95% CI 0.844-1.000), sensitivity: 0.870, specificity: 0.900. The SHAP method provided an intuitive explanation of risk leading to individual predictions. CONCLUSION: We constructed a Yin-deficiency constitution classification model based on machine learning and explained it with the SHAP method, providing an objective Yin-deficiency constitution identification system in TCM and the guidance for clinicians.

8.
Sci Rep ; 14(1): 7691, 2024 04 02.
Article in English | MEDLINE | ID: mdl-38565845

ABSTRACT

Spinal cord injury (SCI) is a prevalent and serious complication among patients with spinal tuberculosis (STB) that can lead to motor and sensory impairment and potentially paraplegia. This research aims to identify factors associated with SCI in STB patients and to develop a clinically significant predictive model. Clinical data from STB patients at a single hospital were collected and divided into training and validation sets. Univariate analysis was employed to screen clinical indicators in the training set. Multiple machine learning (ML) algorithms were utilized to establish predictive models. Model performance was evaluated and compared using receiver operating characteristic (ROC) curves, area under the curve (AUC), calibration curve analysis, decision curve analysis (DCA), and precision-recall (PR) curves. The optimal model was determined, and a prospective cohort from two other hospitals served as a testing set to assess its accuracy. Model interpretation and variable importance ranking were conducted using the DALEX R package. The model was deployed on the web by using the Shiny app. Ten clinical characteristics were utilized for the model. The random forest (RF) model emerged as the optimal choice based on the AUC, PRs, calibration curve analysis, and DCA, achieving a test set AUC of 0.816. Additionally, MONO was identified as the primary predictor of SCI in STB patients through variable importance ranking. The RF predictive model provides an efficient and swift approach for predicting SCI in STB patients.


Subject(s)
Spinal Cord Injuries , Tuberculosis, Spinal , Humans , Prospective Studies , Tuberculosis, Spinal/complications , Spinal Cord Injuries/complications , Algorithms , Machine Learning , Retrospective Studies
9.
ISA Trans ; 148: 374-386, 2024 May.
Article in English | MEDLINE | ID: mdl-38664117

ABSTRACT

Accurate identification of the failure modes of Reinforced Concrete (RC) columns based on the design parameters of the structural members is critical for earthquake-resistant design and safety evaluation of existing structures. Existing identification methods have some problems, such as high cost, incomplete consideration of influencing factors, and low precision or recall in identifying shear or flexural-shear failure. In this paper, the main factors for the failure modes of RC columns are first analyzed and studied. Then, the problem of class imbalance in data samples is investigated. To identify the failure modes of RC columns, oversampling of data (BSB-FMC), model ensembling (RFB-FMC), cost-sensitive learning (CSB-FMC) and a fusion model of three strategies (BSFCB-FMC) are proposed. And finally, the SHapley Additive exPlanations (SHAP) method is used to provide a better interpretation of the designed model. The results show that the developed strategies can improve the accuracy of identifying the failure modes of RC columns compared to the models using a single Artificial Neural Network (ANN), a Support Vector Machine (SVM), a Random Forest (RF), and Adaptive Boosting (AdaBoost). The overall accuracy of the developed BSFCB-FMC model reaches 97%, and the precision and recall for the three failure modes are both above 90%. The designed model provides a solution for fast, accurate and cost-effective identification of the failure modes of RC columns.

10.
Environ Sci Pollut Res Int ; 31(23): 33610-33622, 2024 May.
Article in English | MEDLINE | ID: mdl-38689043

ABSTRACT

Livestock manure is one of the most important pools of antibiotic resistance genes (ARGs) in the environment. Aerobic composting can effectively reduce the spread of antibiotic resistance risk in livestock manure. Understanding the effect of aerobic composting process parameters on manure-sourced ARGs is important to control their spreading risk. In this study, the effects of process parameters on ARGs during aerobic composting of pig manure were explored through data mining based on 191 valid data collected from literature. Machine learning (ML) models (XGBoost and Random Forest) were utilized to predict the rate of ARGs changes during pig manure composting. The model evaluation index of the XGBoost model (R2 = 0.651) was higher than that of the Random Forest (R2 = 0.490), indicating that XGBoost had better prediction performance. Feature importance was further calculated for the XGBoost model, and the XGBoost black box model was interpreted by Shapley additive explanations analysis. Results indicated that the influencing factors on the ARGs variation in pig manure were sequentially divided into thermophilic period, total composting period, composting real time, and thermophilic stage average temperature. The findings gave an insight into the application of ML models to predict and decipher the ARG changes during manure composting and provided suggestions for better composting manipulation and optimization of process parameters.


Subject(s)
Composting , Drug Resistance, Microbial , Machine Learning , Manure , Composting/methods , Animals , Swine , Drug Resistance, Microbial/genetics
11.
Artif Intell Med ; 149: 102785, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38462285

ABSTRACT

Early detection of acute kidney injury (AKI) may provide a crucial window of opportunity to prevent further injury, which helps improve clinical outcomes. This study aimed to develop a deep interpretable network for continuously predicting the 24-hour AKI risk in real-time and evaluate its performance internally and externally in critically ill patients. A total of 21,163 patients' electronic health records sourced from Beth Israel Deaconess Medical Center (BIDMC) were first included in building the model. Two external validation populations included 3025 patients from the Philips eICU Research Institute and 2625 patients from Zhongda Hospital Southeast University. A total of 152 intelligently engineered predictors were extracted on an hourly basis. The prediction model referred to as DeepAKI was designed with the basic framework of squeeze-and-excitation networks with dilated causal convolution embedded. The integrated gradients method was utilized to explain the prediction model. When performed on the internal validation set (3175 [15 %] patients from BIDMC) and the two external validation sets, DeepAKI obtained the area under the curve of 0.799 (95 % CI 0.791-0.806), 0.763 (95 % CI 0.755-0.771) and 0.676 (95 % CI 0.668-0.684) for continuousAKI prediction, respectively. For model interpretability, clinically relevant important variables contributing to the model prediction were informed, and individual explanations along the timeline were explored to show how AKI risk arose. The potential threats to generalisability in deep learning-based models when deployed across health systems in real-world settings were analyzed.


Subject(s)
Acute Kidney Injury , Critical Illness , Humans , Risk Assessment , Risk Factors , Patients , Acute Kidney Injury/diagnosis , Acute Kidney Injury/etiology
12.
Heliyon ; 10(4): e26570, 2024 Feb 29.
Article in English | MEDLINE | ID: mdl-38420451

ABSTRACT

Background: Sepsis-associated acute kidney injury (SA-AKI) is a severe complication associated with poorer prognosis and increased mortality, particularly in elderly patients. Currently, there is a lack of accurate mortality risk prediction models for these patients in clinic. Objectives: This study aimed to develop and validate machine learning models for predicting in-hospital mortality risk in elderly patients with SA-AKI. Methods: Machine learning models were developed and validated using the public, high-quality Medical Information Mart for Intensive Care (MIMIC)-IV critically ill database. The recursive feature elimination (RFE) algorithm was employed for key feature selection. Eleven predictive models were compared, with the best one selected for further validation. Shapley Additive Explanations (SHAP) values were used for visualization and interpretation, making the machine learning models clinically interpretable. Results: There were 16,154 patients with SA-AKI in the MIMIC-IV database, and 8426 SA-AKI patients were included in this study (median age: 77.0 years; female: 45%). 7728 patients excluded based on these criteria. They were randomly divided into a training cohort (5,934, 70%) and a validation cohort (2,492, 30%). Nine key features were selected by the RFE algorithm. The CatBoost model achieved the best performance, with an AUC of 0.844 in the training cohort and 0.804 in the validation cohort. SHAP values revealed that AKI stage, PaO2, and lactate were the top three most important features contributing to the CatBoost model. Conclusion: We developed a model capable of predicting the risk of in-hospital mortality in elderly patients with SA-AKI.

13.
BMC Bioinformatics ; 25(1): 76, 2024 Feb 20.
Article in English | MEDLINE | ID: mdl-38378494

ABSTRACT

BACKGROUND: Genetic ancestry, inferred from genomic data, is a quantifiable biological parameter. While much of the human genome is identical across populations, it is estimated that as much as 0.4% of the genome can differ due to ancestry. This variation is primarily characterized by single nucleotide variants (SNVs), which are often unique to specific genetic populations. Knowledge of a patient's genetic ancestry can inform clinical decisions, from genetic testing and health screenings to medication dosages, based on ancestral disease predispositions. Nevertheless, the current reliance on self-reported ancestry can introduce subjectivity and exacerbate health disparities. While genomic sequencing data enables objective determination of a patient's genetic ancestry, existing approaches are limited to ancestry inference at the continental level. RESULTS: To address this challenge, and create an objective, measurable metric of genetic ancestry we present SNVstory, a method built upon three independent machine learning models for accurately inferring the sub-continental ancestry of individuals. We also introduce a novel method for simulating individual samples from aggregate allele frequencies from known populations. SNVstory includes a feature-importance scheme, unique among open-source ancestral tools, which allows the user to track the ancestral signal broadcast by a given gene or locus. We successfully evaluated SNVstory using a clinical exome sequencing dataset, comparing self-reported ethnicity and race to our inferred genetic ancestry, and demonstrate the capability of the algorithm to estimate ancestry from 36 different populations with high accuracy. CONCLUSIONS: SNVstory represents a significant advance in methods to assign genetic ancestry, opening the door to ancestry-informed care. SNVstory, an open-source model, is packaged as a Docker container for enhanced reliability and interoperability. It can be accessed from https://github.com/nch-igm/snvstory .


Subject(s)
Ethnicity , Genetics, Population , Humans , Reproducibility of Results , Gene Frequency , Ethnicity/genetics , Genetic Testing , Genome, Human , Polymorphism, Single Nucleotide
14.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38279650

ABSTRACT

As the application of large language models (LLMs) has broadened into the realm of biological predictions, leveraging their capacity for self-supervised learning to create feature representations of amino acid sequences, these models have set a new benchmark in tackling downstream challenges, such as subcellular localization. However, previous studies have primarily focused on either the structural design of models or differing strategies for fine-tuning, largely overlooking investigations into the nature of the features derived from LLMs. In this research, we propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis and interpretability techniques, we have illuminated potential associations between diverse feature types and specific subcellular localizations. Particularly, the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. We also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. Furthermore, we have made the code, feature extraction API, and all relevant materials available at https://github.com/yujuan-zhang/feature-representation-for-LLMs.


Subject(s)
Computational Biology , Neural Networks, Computer , Computational Biology/methods , Amino Acid Sequence , Protein Transport
15.
Comput Biol Med ; 170: 108049, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38290319

ABSTRACT

Mammalian embryonic development is a complex process, characterized by intricate spatiotemporal dynamics and distinct chromatin preferences. However, the quick diversification in early embryogenesis leads to significant cellular diversity and the sparsity of scRNA-seq data, posing challenges in accurately determining cell fate decisions. In this study, we introduce a chromatin region binning method using scChrBin, designed to identify chromatin regions that elucidate the dynamics of embryonic development and lineage differentiation. This method transforms scRNA-seq data into a chromatin-based matrix, leveraging genomic annotations. Our results showed that the scChrBin method achieves high accuracy, with 98.0% and 89.2% on two single-cell embryonic datasets, demonstrating its effectiveness in analyzing complex developmental processes. We also systematically and comprehensively analysis of these key chromatin binning regions and their associated genes, focusing on their roles in lineage and stage development. The perspective of chromatin region binning method enables a comprehensive analysis of transcriptome data at the chromatin level, allowing us to unveil the dynamic expression of chromatin regions across temporal and spatial development. The tool is available as an application at https://github.com/liameihao/scChrBin.


Subject(s)
Chromatin , Embryonic Development , Animals , Female , Pregnancy , Chromatin/genetics , Embryonic Development/genetics , Cell Differentiation/genetics , Transcriptome , Genome , Gene Expression Profiling , Sequence Analysis, RNA , Mammals/genetics
16.
ChemMedChem ; 19(3): e202300586, 2024 02 01.
Article in English | MEDLINE | ID: mdl-37983655

ABSTRACT

The use of black box machine learning models whose decisions cannot be understood limits the acceptance of predictions in interdisciplinary research and camouflages artificial learning characteristics leading to predictions for other than anticipated reasons. Consequently, there is increasing interest in explainable artificial intelligence to rationalize predictions and uncover potential pitfalls. Among others, relevant approaches include feature attribution methods to identify molecular structures determining predictions and counterfactuals (CFs) or contrastive explanations. CFs are defined as variants of test instances with minimal modifications leading to opposing predictions. In medicinal chemistry, CFs have thus far only been little investigated although they are particularly intuitive from a chemical perspective. We introduce a new methodology for the systematic generation of CFs that is centered on well-defined structural analogues of test compounds. The approach is transparent, computationally straightforward, and shown to provide a wealth of CFs for test sets. The method is made freely available.


Subject(s)
Artificial Intelligence , Machine Learning , Chemistry, Pharmaceutical , Recombination, Genetic
17.
J Hazard Mater ; 465: 133092, 2024 Mar 05.
Article in English | MEDLINE | ID: mdl-38039812

ABSTRACT

Cancer remains a significant global health concern, with millions of deaths attributed to it annually. Environmental pollutants play a pivotal role in cancer etiology and contribute to the growing prevalence of this disease. The carcinogenic assessment of these pollutants is crucial for chemical health evaluation and environmental risk assessments. Traditional experimental methods are expensive and time-consuming, prompting the development of alternative approaches such as in silico methods. In this regard, deep learning (DL) has shown potential but lacks optimal performance and interpretability. This study introduces an interpretable DL model called CarcGC for chemical carcinogenicity prediction, utilizing a graph convolutional neural network (GCN) that employs molecular structural graphs as inputs. Compared to existing models, CarcGC demonstrated enhanced performance, with the area under the receiver operating characteristic curve (AUCROC) reaching 0.808 on the test set. Due to air pollution is closely related to the incidence of lung cancers, we applied the CarcGC to predict the potential carcinogenicity of chemicals listed in the United States Environmental Protection Agency's Hazardous Air Pollutants (HAPs) inventory, offering a foundation for environmental carcinogenicity screening. This study highlights the potential of artificially intelligent methods in carcinogenicity prediction and underscores the value of CarcGC interpretability in revealing the structural basis and molecular mechanisms underlying chemical carcinogenicity.


Subject(s)
Air Pollutants , Deep Learning , Environmental Pollutants , Neoplasms , United States , Humans , Carcinogens/chemistry
18.
Chemosphere ; 349: 140984, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38122944

ABSTRACT

Hydrated electron reaction rate constant (ke-aq) is an important parameter to determine reductive degradation efficiency and to mitigate the ecological risk of organic compounds (OCs). However, OC species morphology and the concentration of hydrated electrons (e-aq) in water vary with pH, complicating OC fate assessment. This study introduced the environmental variable of pH, to develop models for ke-aq for 701 data points using 3 descriptor types: (i) molecular descriptors (MD), (ii) quantum chemical descriptors (QCD), and (iii) the combination of both (MD + QCD). Models were screened using 2 descriptor screening methods (MLR and RF) and 14 machine learning (ML) algorithms. The introduction of QCDs that characterized the electronic structure of OCs greatly improved the performance of models while ensuring the need for fewer descriptors. The optimal model MLR-XGBoost(MD + QCD), which included pH, achieved the most satisfactory prediction: R2tra = 0.988, Q2boot = 0.861, R2test = 0.875 and Q2test = 0.873. The mechanistic interpretation using the SHAP method further revealed that QCDs, polarizability, volume, and pH had a great influence on the reductive degradation of OCs by e-aq. Overall, the electrochemical parameters (QCDs, pH) related to the solvent and solute are of significance and should be considered in any future ML modeling that assesses the fate of OCs in aquatic environment.


Subject(s)
Electrons , Quantitative Structure-Activity Relationship , Organic Chemicals/chemistry , Solutions , Hydrogen-Ion Concentration
19.
Proc Symp Appl Comput ; 2023: 614-617, 2023 Mar.
Article in English | MEDLINE | ID: mdl-38125287

ABSTRACT

Graph Attention Networks (GAT) have been extensively used to perform node-level classification on data that can be represented as a graph. However, few papers have investigated the effectiveness of using GAT on graph representations of patient similarity networks. This paper proposes Patient-GAT, a novel method to predict chronic health conditions by first integrating multi-modal data fusion to generate patient vector representations using imputed lab variables with other structured data. This data representation is then used to construct a patient network by measuring patient similarity, finally applying GAT to the patient network for disease prediction. We demonstrated our framework by predicting sarcopenia using real-world EHRs obtained from the Indiana Network for Patient Care. We evaluated the performance of our system by comparing it to other baseline models, showing that our model outperforms other methods. In addition, we studied the contribution of the temporal representation of the lab data and discussed the interpretability of this model by analyzing the attention coefficients of the trained Patient-GAT model. Our code can be found on Github.

20.
Insights Imaging ; 14(1): 195, 2023 Nov 19.
Article in English | MEDLINE | ID: mdl-37980637

ABSTRACT

PURPOSE: Interpretability is essential for reliable convolutional neural network (CNN) image classifiers in radiological applications. We describe a weakly supervised segmentation model that learns to delineate the target object, trained with only image-level labels ("image contains object" or "image does not contain object"), presenting a different approach towards explainable object detectors for radiological imaging tasks. METHODS: A weakly supervised Unet architecture (WSUnet) was trained to learn lung tumour segmentation from image-level labelled data. WSUnet generates voxel probability maps with a Unet and then constructs an image-level prediction by global max-pooling, thereby facilitating image-level training. WSUnet's voxel-level predictions were compared to traditional model interpretation techniques (class activation mapping, integrated gradients and occlusion sensitivity) in CT data from three institutions (training/validation: n = 412; testing: n = 142). Methods were compared using voxel-level discrimination metrics and clinical value was assessed with a clinician preference survey on data from external institutions. RESULTS: Despite the absence of voxel-level labels in training, WSUnet's voxel-level predictions localised tumours precisely in both validation (precision: 0.77, 95% CI: [0.76-0.80]; dice: 0.43, 95% CI: [0.39-0.46]), and external testing (precision: 0.78, 95% CI: [0.76-0.81]; dice: 0.33, 95% CI: [0.32-0.35]). WSUnet's voxel-level discrimination outperformed the best comparator in validation (area under precision recall curve (AUPR): 0.55, 95% CI: [0.49-0.56] vs. 0.23, 95% CI: [0.21-0.25]) and testing (AUPR: 0.40, 95% CI: [0.38-0.41] vs. 0.36, 95% CI: [0.34-0.37]). Clinicians preferred WSUnet predictions in most instances (clinician preference rate: 0.72 95% CI: [0.68-0.77]). CONCLUSION: Weakly supervised segmentation is a viable approach by which explainable object detection models may be developed for medical imaging. CRITICAL RELEVANCE STATEMENT: WSUnet learns to segment images at voxel level, training only with image-level labels. A Unet backbone first generates a voxel-level probability map and then extracts the maximum voxel prediction as the image-level prediction. Thus, training uses only image-level annotations, reducing human workload. WSUnet's voxel-level predictions provide a causally verifiable explanation for its image-level prediction, improving interpretability. KEY POINTS: • Explainability and interpretability are essential for reliable medical image classifiers. • This study applies weakly supervised segmentation to generate explainable image classifiers. • The weakly supervised Unet inherently explains its image-level predictions at voxel level.

SELECTION OF CITATIONS
SEARCH DETAIL